Part 1. Preliminary data preparation

AirQualityUCI

Source of data: https://archive.ics.uci.edu/ml/datasets/Air+Quality

import pandas as pd
df = pd.read_csv('c:/TS/AirQualityUCI.csv', sep=';')
df.head(3)

Data Set Information:

The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.
This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.

Supplementing data for further analysis

Attribute Information:

Date (DD/MM/YYYY)
Time (HH.MM.SS)
True hourly averaged concentration CO in mg/m^3 (reference analyzer)
PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
True hourly averaged NOx concentration in ppb (reference analyzer)
PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
Temperature in Â°C
Relative Humidity (%)
AH Absolute Humidity

Tutorial: Supplementing data for further analysis

Step 1. Data completeness check

df.isnull().sum()

Date              114
Time              114
CO(GT)            114
PT08.S1(CO)       114
NMHC(GT)          114
C6H6(GT)          114
PT08.S2(NMHC)     114
NOx(GT)           114
PT08.S3(NOx)      114
NO2(GT)           114
PT08.S4(NO2)      114
PT08.S5(O3)       114
T                 114
RH                114
AH                114
Unnamed: 15      9471
Unnamed: 16      9471
dtype: int64

There are a lot of missing values. In addition, we learned that the value -200 means no data. We’ll deal with this in a moment. We will now check the statistics of variables in the database.

df.agg(['min', 'max', 'mean', 'median'])

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesnumpylibnanfunctions.py:1112: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)

(9471, 17)

Date             114
Time             114
CO(GT)           114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
AH               114
dtype: int64

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Date                0
Time                0
CO(GT)           1683
PT08.S1(CO)       366
NMHC(GT)         8443
C6H6(GT)          366
PT08.S2(NMHC)     366
NOx(GT)          1639
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64

<function matplotlib.pyplot.show(*args, **kw)>

Date              object
Time              object
CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
dtype: object

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8b43390>

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

df.shape

(9471, 17)

We delete two empty columns.

del df['Unnamed: 15']
del df['Unnamed: 16']

Step 1: Preliminary analysis of data gaps

One more look at how many NaN cells there are.

df.isnull().sum()

Date             114
Time             114
CO(GT)           114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
AH               114
dtype: int64

Now I will try to see these empty cells.

df[df['NMHC(GT)'].isnull()]

These are completely empty time series. The device was probably cut off from the power supply, no sensor was working.

df = df.dropna(how='all')
df.isnull().sum()

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

We are looking for variables with the value -200 because this means there is no data. The -200 values are entered differently, so I have to do the replacement process in many ways.

import numpy as np

df = df.replace(-200,np.NaN)
df = df.replace('-200',np.NaN)
df = df.replace('-200.0',np.NaN)
df = df.replace('-200,0',np.NaN)

The value of -200 has been changed to NaN and we will see how many empty records there are now.

df.isnull().sum()

Date                0
Time                0
CO(GT)           1683
PT08.S1(CO)       366
NMHC(GT)         8443
C6H6(GT)          366
PT08.S2(NMHC)     366
NOx(GT)          1639
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64

Chart of missing data structure.

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show

<function matplotlib.pyplot.show(*args, **kw)>

The NMHC (GT) variable is the most incomplete, we eliminate it from research

del df['NMHC(GT)']

We displaying records with missing data – Function isna ()

df1 = df[df.isna().any(axis=1)]
df1

Step 2: Check the level of direct correlation to complete the data

CO (GT) there is no data there every few measurements, you have to check what this variable correlates with. I check the data type to make a correlation.

df.dtypes

Date              object
Time              object
CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
dtype: object

# df['CO(GT)'].astype(float)

ValueError: could not convert string to float: '2,6′

It turns out that it is not so easy to convert text to number format – the problem is in commas. We replace commas with dots.

df['CO(GT)'] = df['CO(GT)'].str.replace(',','.')
df['C6H6(GT)'] = df['C6H6(GT)'].str.replace(',','.')
df['T'] = df['T'].str.replace(',','.')
df['RH'] = df['RH'].str.replace(',','.')
df['AH'] = df['AH'].str.replace(',','.')

We change the format from object to float

df[['CO(GT)','C6H6(GT)', 'T','RH','AH']] = df[['CO(GT)','C6H6(GT)', 'T','RH','AH']].astype(float)

df.dtypes

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

We can now check the level of direct correlation.

df.corr()

sns.set(style="ticks")

corr = df.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(22, 10))
cmap = sns.diverging_palette(180, 50, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.3, center=0.1,annot=True,
            square=True, linewidths=.9, cbar_kws={"shrink": 0.8})

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8b43390>

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

missing value in CO(GT):  1683

missing value:  383

missing value:  370

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

Step 3. Filling the gaps in variables based on other variables correlated with it

Filling gaps in the CO (GT) variable.

I check what this variable is strongly correlated with and supplement based on this variable, if not, I supplement it as the last or next value.

df.corr()

df.dtypes

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

print('missing value in CO(GT): ',df['CO(GT)'].isnull().sum())

missing value in CO(GT):  1683

missing value:  383

missing value:  370

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

CO (GT) correlation with other variables.

CORREL = df.corr()
CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')

The largest correlation with CO (GT) occurs for C6H6 (GT) which is quite complete. Based on this variable, I fill in the deficiencies in CO (GT).

df['CO(GT)'] = df.groupby('C6H6(GT)')['CO(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['CO(GT)'].isnull().sum())

missing value:  383

missing value:  370

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

df['CO(GT)'] = df.groupby('PT08.S1(CO)')['CO(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['CO(GT)'].isnull().sum())

missing value:  370

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

(9357, 14)

Now I do simple refilling – the last good value.

df['CO(GT)'].fillna(method='ffill', inplace=True)

print('missing value: ',df['CO(GT)'].isnull().sum())

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

(9357, 14)

(9357, 14)

Filling gaps in the variable 'C6H6 (GT)’

print('missing value: ',df['C6H6(GT)'].isnull().sum())

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

(9357, 14)

(9357, 14)

<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>

df['C6H6(GT)'] = df.groupby('CO(GT)')['C6H6(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['C6H6(GT)'].isnull().sum())

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

(9357, 14)

(9357, 14)

<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Filling gaps in the variable 'NOx(GT)’

print('brakuje wartości: ',df['NOx(GT)'].isnull().sum())

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

(9357, 14)

(9357, 14)

<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

CORREL['NOx(GT)'].to_frame().sort_values('NOx(GT)')

df['NOx(GT)'] = df.groupby('CO(GT)')['NOx(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['NOx(GT)'].isnull().sum())

missing value:  0

missing value:  1642

missing value:  0

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

(9357, 14)

(9357, 14)

<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Filling gaps in the variable 'C6H6 (GT)’

print('missing value: ',df['NO2(GT)'].isnull().sum())

missing value:  1642

missing value:  0

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

(9357, 14)

(9357, 14)

<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

CORREL['NO2(GT)'].to_frame().sort_values('NO2(GT)')

df['NO2(GT)'] = df.groupby('PT08.S5(O3)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())

df['NO2(GT)'] = df.groupby('CO(GT)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['NO2(GT)'].isnull().sum())

missing value:  0

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

(9357, 14)

(9357, 14)

<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')

<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>

(9357, 14)

(9357, 14)

<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

I complete the records where the entire measuring device did not work.

In the drawing it can be seen as solid lines.

df.shape

(9357, 14)

df.fillna(method='ffill', inplace=True)

df.shape

(9357, 14)

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='Reds')

<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

df.isnull().sum()

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Data have been completed! In the next parts of this tutorial, we will carry out the process of building a linear regression model at TensorFlow.

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH	Unnamed: 15	Unnamed: 16
0	10/03/2004	18.00.00	2,6	1360.0	150.0	11,9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13,6	48,9	0,7578	NaN	NaN
1	10/03/2004	19.00.00	2	1292.0	112.0	9,4	955.0	103.0	1174.0	92.0	1559.0	972.0	13,3	47,7	0,7255	NaN	NaN
2	10/03/2004	20.00.00	2,2	1402.0	88.0	9,0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11,9	54,0	0,7502	NaN	NaN

	PT08.S1(CO)	NMHC(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	Unnamed: 15	Unnamed: 16
min	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	NaN	NaN
max	2040.000000	1189.000000	2214.000000	1479.000000	2683.000000	340.000000	2775.000000	2523.000000	NaN	NaN
mean	1048.990061	-159.090093	894.595276	168.616971	794.990168	58.148873	1391.479641	975.072032	NaN	NaN
median	1053.000000	-200.000000	895.000000	141.000000	794.000000	96.000000	1446.000000	942.000000	NaN	NaN

	Date	Time	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
9	11/03/2004	03.00.00	0,6	1010.0	1,7	561.0	NaN	1705.0	NaN	1235.0	501.0	10,3	60,2	0,7517
10	11/03/2004	04.00.00	NaN	1011.0	1,3	527.0	21.0	1818.0	34.0	1197.0	445.0	10,1	60,5	0,7465
33	12/03/2004	03.00.00	0,8	889.0	1,9	574.0	NaN	1680.0	NaN	1187.0	512.0	7,0	62,3	0,6261
34	12/03/2004	04.00.00	NaN	831.0	1,1	506.0	21.0	1893.0	32.0	1134.0	384.0	6,1	65,9	0,6248
39	12/03/2004	09.00.00	NaN	1545.0	22,1	1353.0	NaN	767.0	NaN	2058.0	1588.0	9,2	56,2	0,6561
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
9058	23/03/2005	04.00.00	NaN	993.0	2,3	604.0	85.0	848.0	65.0	1160.0	762.0	14,5	66,4	1,0919
9130	26/03/2005	04.00.00	NaN	1122.0	6,0	811.0	181.0	641.0	92.0	1336.0	1122.0	16,2	71,2	1,3013
9202	29/03/2005	04.00.00	NaN	883.0	1,3	530.0	63.0	997.0	46.0	1102.0	617.0	13,7	68,2	1,0611
9274	01/04/2005	04.00.00	NaN	818.0	0,8	473.0	47.0	1257.0	41.0	898.0	323.0	13,7	48,8	0,7606
9346	04/04/2005	04.00.00	NaN	864.0	0,8	478.0	52.0	1116.0	43.0	958.0	489.0	11,8	56,0	0,7743

	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
CO(GT)	1.000000	0.879288	0.931078	0.915514	0.795028	-0.703446	0.683343	0.630703	0.854182	0.022109	0.048890	0.048556
PT08.S1(CO)	0.879288	1.000000	0.883795	0.892964	0.713654	-0.771938	0.641529	0.682881	0.899324	0.048627	0.114606	0.135324
C6H6(GT)	0.931078	0.883795	1.000000	0.981950	0.718839	-0.735744	0.614474	0.765731	0.865689	0.198956	-0.061681	0.167972
PT08.S2(NMHC)	0.915514	0.892964	0.981950	1.000000	0.704435	-0.796703	0.646245	0.777254	0.880578	0.241373	-0.090380	0.186933
NOx(GT)	0.795028	0.713654	0.718839	0.704435	1.000000	-0.655707	0.763111	0.233731	0.787046	-0.269683	0.221032	-0.149323
PT08.S3(NOx)	-0.703446	-0.771938	-0.735744	-0.796703	-0.655707	1.000000	-0.652083	-0.538468	-0.796569	-0.145112	-0.056740	-0.232017
NO2(GT)	0.683343	0.641529	0.614474	0.646245	0.763111	-0.652083	1.000000	0.157360	0.708128	-0.186533	-0.091759	-0.335022
PT08.S4(NO2)	0.630703	0.682881	0.765731	0.777254	0.233731	-0.538468	0.157360	1.000000	0.591144	0.561270	-0.032188	0.629641
PT08.S5(O3)	0.854182	0.899324	0.865689	0.880578	0.787046	-0.796569	0.708128	0.591144	1.000000	-0.027172	0.124956	0.070751
T	0.022109	0.048627	0.198956	0.241373	-0.269683	-0.145112	-0.186533	0.561270	-0.027172	1.000000	-0.578621	0.656397
RH	0.048890	0.114606	-0.061681	-0.090380	0.221032	-0.056740	-0.091759	-0.032188	0.124956	-0.578621	1.000000	0.167971
AH	0.048556	0.135324	0.167972	0.186933	-0.149323	-0.232017	-0.335022	0.629641	0.070751	0.656397	0.167971	1.000000

	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
CO(GT)	1.000000	0.879288	0.931078	0.915514	0.795028	-0.703446	0.683343	0.630703	0.854182	0.022109	0.048890	0.048556
PT08.S1(CO)	0.879288	1.000000	0.883795	0.892964	0.713654	-0.771938	0.641529	0.682881	0.899324	0.048627	0.114606	0.135324
C6H6(GT)	0.931078	0.883795	1.000000	0.981950	0.718839	-0.735744	0.614474	0.765731	0.865689	0.198956	-0.061681	0.167972
PT08.S2(NMHC)	0.915514	0.892964	0.981950	1.000000	0.704435	-0.796703	0.646245	0.777254	0.880578	0.241373	-0.090380	0.186933
NOx(GT)	0.795028	0.713654	0.718839	0.704435	1.000000	-0.655707	0.763111	0.233731	0.787046	-0.269683	0.221032	-0.149323
PT08.S3(NOx)	-0.703446	-0.771938	-0.735744	-0.796703	-0.655707	1.000000	-0.652083	-0.538468	-0.796569	-0.145112	-0.056740	-0.232017
NO2(GT)	0.683343	0.641529	0.614474	0.646245	0.763111	-0.652083	1.000000	0.157360	0.708128	-0.186533	-0.091759	-0.335022
PT08.S4(NO2)	0.630703	0.682881	0.765731	0.777254	0.233731	-0.538468	0.157360	1.000000	0.591144	0.561270	-0.032188	0.629641
PT08.S5(O3)	0.854182	0.899324	0.865689	0.880578	0.787046	-0.796569	0.708128	0.591144	1.000000	-0.027172	0.124956	0.070751
T	0.022109	0.048627	0.198956	0.241373	-0.269683	-0.145112	-0.186533	0.561270	-0.027172	1.000000	-0.578621	0.656397
RH	0.048890	0.114606	-0.061681	-0.090380	0.221032	-0.056740	-0.091759	-0.032188	0.124956	-0.578621	1.000000	0.167971
AH	0.048556	0.135324	0.167972	0.186933	-0.149323	-0.232017	-0.335022	0.629641	0.070751	0.656397	0.167971	1.000000

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Tutorial: Linear Regression – preliminary data preparation (#1/271120191024)

Part 1. Preliminary data preparation

AirQualityUCI

Data Set Information:

Attribute Information:

Step 1: Preliminary analysis of data gaps

Step 2: Check the level of direct correlation to complete the data

Step 3. Filling the gaps in variables based on other variables correlated with it

Filling gaps in the variable 'C6H6 (GT)’

Filling gaps in the variable 'NOx(GT)’

Filling gaps in the variable 'C6H6 (GT)’

I complete the records where the entire measuring device did not work.

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
9357	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9358	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9359	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9360	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9361	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
9466	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9467	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9468	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9469	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9470	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN