Part 2. Simple multifactorial linear regression

In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed measurement data file has been created, which we will now open.

We will now continue to prepare data for further analysis. One of the most important variables describing in linear regression is time. Most artificial and natural phenomena operate in hourly, daily and monthly cycles.

import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)

Step 1. Launching the time variable

We check what the date format is

df[['Date','Time']].dtypes

Date    object
Time    object
dtype: object

There is no date format in dataframe. Link columns containing time.

df['DATE'] = df['Date']+' '+df['Time']
df['DATE'].head()

0    10/03/2004 18.00.00
1    10/03/2004 19.00.00
2    10/03/2004 20.00.00
3    10/03/2004 21.00.00
4    10/03/2004 22.00.00
Name: DATE, dtype: object

We create a new column containing the date and time. Then we convert the object format to the date format.

df['DATE'] = pd.to_datetime(df.DATE, format='df.dtypes

Unnamed: 0                int64
Date                     object
Time                     object
CO(GT)                  float64
PT08.S1(CO)             float64
C6H6(GT)                float64
PT08.S2(NMHC)           float64
NOx(GT)                 float64
PT08.S3(NOx)            float64
NO2(GT)                 float64
PT08.S4(NO2)            float64
PT08.S5(O3)             float64
T                       float64
RH                      float64
AH                      float64
DATE             datetime64[ns]
dtype: object

Step 2. We add more columns based on the time variable

In industry, the day of the week is very important, so in such models it is worth adding a column with the number of the day.

df['Month'] = df['DATE'].dt.month
df['Weekday'] = df['DATE'].dt.weekday
df['Weekday_name'] = df['DATE'].dt.weekday_name
df['Hours'] = df['DATE'].dt.hour

df[['DATE','Month','Weekday','Weekday_name','Hours']].sample(3)

Graphical analysis of pollution according to time variables

df.pivot_table(index='Weekday_name', values='CO(GT)', aggfunc='mean').plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x12f20bbdb38>

df.pivot_table(index='Month', values='CO(GT)', aggfunc='mean').plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x12f20f02320>

<matplotlib.axes._subplots.AxesSubplot at 0x12f210e77f0>

Text(0, 0.5, 'Continuous independent variables')

<Figure size 720x576 with 0 Axes>

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

df.pivot_table(index='Hours', values='CO(GT)', aggfunc='mean').plot(kind='bar')

<matplotlib.axes._subplots.AxesSubplot at 0x12f210e77f0>

Text(0, 0.5, 'Continuous independent variables')

<Figure size 720x576 with 0 Axes>

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Step 3. Correlation analysis¶

we set the result variable as:

CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)

del df['Unnamed: 0']

CORREL = df.corr()
PKP = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PKP

import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
PKP.plot(kind='barh', color='red')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')

Text(0, 0.5, 'Continuous independent variables')

<Figure size 720x576 with 0 Axes>

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Variables based on time are not very well correlated with the described variable: CO (GT).
The temptation arises to use better, better correlated independent variables for the model. The problem is that these variables can be part of the result. So if there is pollution then all substances are in the air.

Our task is to examine how weather and time affect the level of pollution. We’ll cover this task in the next part of the tutorial.

Step 4. We are now checking shift

for independent variables with low direct correlation.
How does weather affect CO2 levels?

def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

x = df.RH       # independent variable
y = df['CO(GT)']    # dependent variable
R = 20           # number of shifts who will be checked

SKO = shift_Factor(x,y,R)
print('Optimal shift for RH: ',SKO)

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

cross_corr(x, y, lag=SKO)

0.39204313671898056

Variable AH – Absolute humidity

We check a variable with very low correlation with the resulting CO (GT) variable

x = df.AH       # independent variable
SKP = shift_Factor(x,y,R)
print('Optimal shift for AH: ',SKP)

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

cross_corr(x, y, lag=SKP)

0.043756364102677595

Absolute humidity AH does not correlate with the variable CO (GT) so we eliminate it from the model

Variable: Temperature in ° C¶

We check a variable with very low correlation with the resulting CO (GT) variable.

x = df['T']      # independent variable
PKP = shift_Factor(x,y,R)
print('Optimal shift for T: ',PKP)

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

cross_corr(x, y, lag=PKP)

-0.22446569561762522

We are now creating a new DataFrame with a 12-hour shift

It turns out that temperature and humidity only correlate after 12 hours from the time the CO contamination changes.
Data shift creation function.

def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)

Our goal is to create a multiple regression model:

- Independent variables are: Temperature (T) and Relative humidity RH 
- The dependent variable is the level of CO (GT)

df2 = df[['DATE', 'CO(GT)','RH', 'T']]

Adds a date and time to record temperature and humidity.

df2['weather_time'] = df2['DATE']

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

df2.head(3)

df3 = df_shif(df2, 'weather_time', lag=12)
df3.rename(columns={'weather_time':'Shift_weather_time'}, inplace=True) 
df3.head(13)

df4 = df_shif(df3, 'RH', lag=12)
df4.rename(columns={'RH':'Shift_RH'}, inplace=True)

df5 = df_shif(df4, 'T', lag=12)
df5.rename(columns={'T':'Shift_T'}, inplace=True)

Deletes rows with incomplete data.

df5 = df5.dropna(how ='any')

df5.head()

The table can be understood as meaning that a specific temperature at 6:00 gives a specific concentration of carbon monoxide at 18:00.

Graphical analysis of relationships – Humidity and temperature to carbon monoxide

It looks rather poor

import matplotlib.pyplot as plt

df5.plot(x='Shift_T', y='CO(GT)', style='o')  
plt.title('Shift_T vs CO(GT)')  
plt.xlabel('Shift_T')  
plt.ylabel('CO(GT)')  
plt.show()

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

df5.plot(x='Shift_RH', y='CO(GT)', style='o')  
plt.title('Shift_RH vs CO(GT)')  
plt.xlabel('Shift_RH')  
plt.ylabel('CO(GT)')  
plt.show()

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

Step 5. Building a multiple linear regression model in Sklearn

Declares X, y variables into the model.

X = df5[['Shift_RH', 'Shift_T']].values
y = df5['CO(GT)'].values

I divide the collection into training variables and test variables.

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

Comparison of variables from the model with real variables.

dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)

from sklearn import metrics

dfKK.head(50).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

from sklearn import metrics

print('Mean Absolute Error:    ', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:     ', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.15437562015505324

Carbon monoxide contamination cannot be predicted based on humidity and temperature.
In the next part, we will continue the analysis and preparation of data for linear regression.

	Unnamed: 0	Date	Time	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
0	0	10/03/2004	18.00.00	2.6	1360.0	11.9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.6	48.9	0.7578
1	1	10/03/2004	19.00.00	2.0	1292.0	9.4	955.0	103.0	1174.0	92.0	1559.0	972.0	13.3	47.7	0.7255
2	2	10/03/2004	20.00.00	2.2	1402.0	9.0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11.9	54.0	0.7502

	DATE	Month	Weekday	Weekday_name	Hours
6109	2004-11-20 07:00:00	11	5	Saturday	7
3537	2004-08-05 03:00:00	8	3	Thursday	3
8053	2005-02-09 07:00:00	2	2	Wednesday	7

	DATE	CO(GT)	RH	T	weather_time
0	2004-03-10 18:00:00	2.6	48.9	13.6	2004-03-10 18:00:00
1	2004-03-10 19:00:00	2.0	47.7	13.3	2004-03-10 19:00:00
2	2004-03-10 20:00:00	2.2	54.0	11.9	2004-03-10 20:00:00

	DATE	CO(GT)	RH	T	Shift_weather_time
0	NaT	NaN	NaN	NaN	2004-03-10 18:00:00
1	NaT	NaN	NaN	NaN	2004-03-10 19:00:00
2	NaT	NaN	NaN	NaN	2004-03-10 20:00:00
3	NaT	NaN	NaN	NaN	2004-03-10 21:00:00
4	NaT	NaN	NaN	NaN	2004-03-10 22:00:00
5	NaT	NaN	NaN	NaN	2004-03-10 23:00:00
6	NaT	NaN	NaN	NaN	2004-03-11 00:00:00
7	NaT	NaN	NaN	NaN	2004-03-11 01:00:00
8	NaT	NaN	NaN	NaN	2004-03-11 02:00:00
9	NaT	NaN	NaN	NaN	2004-03-11 03:00:00
10	NaT	NaN	NaN	NaN	2004-03-11 04:00:00
11	NaT	NaN	NaN	NaN	2004-03-11 05:00:00
12	2004-03-10 18:00:00	2.6	48.9	13.6	2004-03-11 06:00:00

	DATE	CO(GT)	Shift_RH	Shift_T	Shift_weather_time
36	2004-03-10 18:00:00	2.6	58.1	10.5	2004-03-11 06:00:00
37	2004-03-10 19:00:00	2.0	59.6	10.2	2004-03-11 07:00:00
38	2004-03-10 20:00:00	2.2	57.4	10.8	2004-03-11 08:00:00
39	2004-03-10 21:00:00	2.2	60.6	10.5	2004-03-11 09:00:00
40	2004-03-10 22:00:00	1.6	58.4	10.8	2004-03-11 10:00:00

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334)