Part 1. Preliminary data preparation
AirQualityUCI
Source of data: https://archive.ics.uci.edu/ml/datasets/Air+Quality
import pandas as pd
df = pd.read_csv('c:/TS/AirQualityUCI.csv', sep=';')
df.head(3)
Data Set Information:
The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.
This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.
Supplementing data for further analysis
Attribute Information:
Date (DD/MM/YYYY)
Time (HH.MM.SS)
True hourly averaged concentration CO in mg/m^3 (reference analyzer)
PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
True hourly averaged NOx concentration in ppb (reference analyzer)
PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
Temperature in °C
Relative Humidity (
AH Absolute Humidity
Tutorial: Supplementing data for further analysis
Step 1. Data completeness check
df.isnull().sum()
There are a lot of missing values. In addition, we learned that the value -200 means no data. We’ll deal with this in a moment. We will now check the statistics of variables in the database.
df.agg(['min', 'max', 'mean', 'median'])
df.shape
We delete two empty columns.
del df['Unnamed: 15']
del df['Unnamed: 16']
Step 1: Preliminary analysis of data gaps
One more look at how many NaN cells there are.
df.isnull().sum()
Now I will try to see these empty cells.
df[df['NMHC(GT)'].isnull()]
These are completely empty time series. The device was probably cut off from the power supply, no sensor was working.
df = df.dropna(how='all')
df.isnull().sum()
We are looking for variables with the value -200 because this means there is no data. The -200 values are entered differently, so I have to do the replacement process in many ways.
import numpy as np
df = df.replace(-200,np.NaN)
df = df.replace('-200',np.NaN)
df = df.replace('-200.0',np.NaN)
df = df.replace('-200,0',np.NaN)
The value of -200 has been changed to NaN and we will see how many empty records there are now.
df.isnull().sum()
Chart of missing data structure.
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show
The NMHC (GT) variable is the most incomplete, we eliminate it from research
del df['NMHC(GT)']
We displaying records with missing data – Function isna ()
df1 = df[df.isna().any(axis=1)]
df1
Step 2: Check the level of direct correlation to complete the data
CO (GT) there is no data there every few measurements, you have to check what this variable correlates with. I check the data type to make a correlation.
df.dtypes
# df['CO(GT)'].astype(float)
ValueError: could not convert string to float: '2,6′
It turns out that it is not so easy to convert text to number format – the problem is in commas. We replace commas with dots.
df['CO(GT)'] = df['CO(GT)'].str.replace(',','.')
df['C6H6(GT)'] = df['C6H6(GT)'].str.replace(',','.')
df['T'] = df['T'].str.replace(',','.')
df['RH'] = df['RH'].str.replace(',','.')
df['AH'] = df['AH'].str.replace(',','.')
We change the format from object to float
df[['CO(GT)','C6H6(GT)', 'T','RH','AH']] = df[['CO(GT)','C6H6(GT)', 'T','RH','AH']].astype(float)
df.dtypes
We can now check the level of direct correlation.
df.corr()
sns.set(style="ticks")
corr = df.corr()
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(22, 10))
cmap = sns.diverging_palette(180, 50, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.3, center=0.1,annot=True,
square=True, linewidths=.9, cbar_kws={"shrink": 0.8})
Step 3. Filling the gaps in variables based on other variables correlated with it
Filling gaps in the CO (GT) variable.
I check what this variable is strongly correlated with and supplement based on this variable, if not, I supplement it as the last or next value.
df.corr()
df.dtypes
print('missing value in CO(GT): ',df['CO(GT)'].isnull().sum())
CO (GT) correlation with other variables.
CORREL = df.corr()
CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
The largest correlation with CO (GT) occurs for C6H6 (GT) which is quite complete. Based on this variable, I fill in the deficiencies in CO (GT).
df['CO(GT)'] = df.groupby('C6H6(GT)')['CO(GT)'].apply(lambda x: x.ffill().bfill())
print('missing value: ',df['CO(GT)'].isnull().sum())
df['CO(GT)'] = df.groupby('PT08.S1(CO)')['CO(GT)'].apply(lambda x: x.ffill().bfill())
print('missing value: ',df['CO(GT)'].isnull().sum())
Now I do simple refilling – the last good value.
df['CO(GT)'].fillna(method='ffill', inplace=True)
print('missing value: ',df['CO(GT)'].isnull().sum())
Filling gaps in the variable 'C6H6 (GT)’
print('missing value: ',df['C6H6(GT)'].isnull().sum())
df['C6H6(GT)'] = df.groupby('CO(GT)')['C6H6(GT)'].apply(lambda x: x.ffill().bfill())
print('missing value: ',df['C6H6(GT)'].isnull().sum())
Filling gaps in the variable 'NOx(GT)’
print('brakuje wartości: ',df['NOx(GT)'].isnull().sum())
CORREL['NOx(GT)'].to_frame().sort_values('NOx(GT)')
df['NOx(GT)'] = df.groupby('CO(GT)')['NOx(GT)'].apply(lambda x: x.ffill().bfill())
print('missing value: ',df['NOx(GT)'].isnull().sum())
Filling gaps in the variable 'C6H6 (GT)’
print('missing value: ',df['NO2(GT)'].isnull().sum())
CORREL['NO2(GT)'].to_frame().sort_values('NO2(GT)')
df['NO2(GT)'] = df.groupby('PT08.S5(O3)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())
df['NO2(GT)'] = df.groupby('CO(GT)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())
print('missing value: ',df['NO2(GT)'].isnull().sum())
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')
I complete the records where the entire measuring device did not work.
In the drawing it can be seen as solid lines.
df.shape
df.fillna(method='ffill', inplace=True)
df.shape
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='Reds')
df.isnull().sum()
Data have been completed! In the next parts of this tutorial, we will carry out the process of building a linear regression model at TensorFlow.
Let’s save the completed file to disk
df.to_csv('c:/TF/AirQ_filled.csv')
df2 = pd.read_csv('c:/TF/AirQ_filled.csv')
df2.head(3)
Tutorial: Supplementing data for further analysis
