Part 2. Simple multifactorial linear regression

In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed measurement data file has been created, which we will now open.

We will now continue to prepare data for further analysis. One of the most important variables describing in linear regression is time. Most artificial and natural phenomena operate in hourly, daily and monthly cycles.

import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)

Step 1. Launching the time variable

We check what the date format is

df[['Date','Time']].dtypes

Date    object
Time    object
dtype: object

There is no date format in dataframe. Link columns containing time.

df['DATE'] = df['Date']+' '+df['Time']
df['DATE'].head()

0    10/03/2004 18.00.00
1    10/03/2004 19.00.00
2    10/03/2004 20.00.00
3    10/03/2004 21.00.00
4    10/03/2004 22.00.00
Name: DATE, dtype: object

We create a new column containing the date and time. Then we convert the object format to the date format.

df['DATE'] = pd.to_datetime(df.DATE, format='
df.dtypes

Unnamed: 0                int64
Date                     object
Time                     object
CO(GT)                  float64
PT08.S1(CO)             float64
C6H6(GT)                float64
PT08.S2(NMHC)           float64
NOx(GT)                 float64
PT08.S3(NOx)            float64
NO2(GT)                 float64
PT08.S4(NO2)            float64
PT08.S5(O3)             float64
T                       float64
RH                      float64
AH                      float64
DATE             datetime64[ns]
dtype: object

Step 2. We add more columns based on the time variable

In industry, the day of the week is very important, so in such models it is worth adding a column with the number of the day.

df['Month'] = df['DATE'].dt.month
df['Weekday'] = df['DATE'].dt.weekday
df['Weekday_name'] = df['DATE'].dt.weekday_name
df['Hours'] = df['DATE'].dt.hour

df[['DATE','Month','Weekday','Weekday_name','Hours']].sample(3)

Graphical analysis of pollution according to time variables

df.pivot_table(index='Weekday_name', values='CO(GT)', aggfunc='mean').plot(kind='bar')

df.pivot_table(index='Month', values='CO(GT)', aggfunc='mean').plot(kind='bar')

Text(0, 0.5, 'Continuous independent variables')

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

df.pivot_table(index='Hours', values='CO(GT)', aggfunc='mean').plot(kind='bar')

Text(0, 0.5, 'Continuous independent variables')

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Step 3. Correlation analysis¶

we set the result variable as:

CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)

del df['Unnamed: 0']

CORREL = df.corr()
PKP = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PKP

import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
PKP.plot(kind='barh', color='red')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')

Text(0, 0.5, 'Continuous independent variables')

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Variables based on time are not very well correlated with the described variable: CO (GT).
The temptation arises to use better, better correlated independent variables for the model. The problem is that these variables can be part of the result. So if there is pollution then all substances are in the air.

Our task is to examine how weather and time affect the level of pollution. We’ll cover this task in the next part of the tutorial.

Step 4. We are now checking shift

for independent variables with low direct correlation.
How does weather affect CO2 levels?

Variable RH – Relative humidity (
We check a variable with very low correlation with the resulting CO (GT) variable

def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF

x = df.RH       # independent variable
y = df['CO(GT)']    # dependent variable
R = 20           # number of shifts who will be checked

SKO = shift_Factor(x,y,R)
print('Optimal shift for RH: ',SKO)

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

cross_corr(x, y, lag=SKO)

0.39204313671898056

Variable AH – Absolute humidity

We check a variable with very low correlation with the resulting CO (GT) variable

x = df.AH       # independent variable
SKP = shift_Factor(x,y,R)
print('Optimal shift for AH: ',SKP)

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

cross_corr(x, y, lag=SKP)

0.043756364102677595

Absolute humidity AH does not correlate with the variable CO (GT) so we eliminate it from the model

Variable: Temperature in ° C¶

We check a variable with very low correlation with the resulting CO (GT) variable.

x = df['T']      # independent variable
PKP = shift_Factor(x,y,R)
print('Optimal shift for T: ',PKP)

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

cross_corr(x, y, lag=PKP)

-0.22446569561762522

We are now creating a new DataFrame with a 12-hour shift

It turns out that temperature and humidity only correlate after 12 hours from the time the CO contamination changes.
Data shift creation function.

def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)

Our goal is to create a multiple regression model:

- Independent variables are: Temperature (T) and Relative humidity RH (
- The dependent variable is the level of CO (GT)

df2 = df[['DATE', 'CO(GT)','RH', 'T']]

Adds a date and time to record temperature and humidity.

df2['weather_time'] = df2['DATE']

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

df2.head(3)

df3 = df_shif(df2, 'weather_time', lag=12)
df3.rename(columns={'weather_time':'Shift_weather_time'}, inplace=True) 
df3.head(13)

df4 = df_shif(df3, 'RH', lag=12)
df4.rename(columns={'RH':'Shift_RH'}, inplace=True)

df5 = df_shif(df4, 'T', lag=12)
df5.rename(columns={'T':'Shift_T'}, inplace=True)

Deletes rows with incomplete data.

df5 = df5.dropna(how ='any')

df5.head()

The table can be understood as meaning that a specific temperature at 6:00 gives a specific concentration of carbon monoxide at 18:00.

Graphical analysis of relationships – Humidity and temperature to carbon monoxide

It looks rather poor

import matplotlib.pyplot as plt

df5.plot(x='Shift_T', y='CO(GT)', style='o')  
plt.title('Shift_T vs CO(GT)')  
plt.xlabel('Shift_T')  
plt.ylabel('CO(GT)')  
plt.show()

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

df5.plot(x='Shift_RH', y='CO(GT)', style='o')  
plt.title('Shift_RH vs CO(GT)')  
plt.xlabel('Shift_RH')  
plt.ylabel('CO(GT)')  
plt.show()

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

Step 5. Building a multiple linear regression model in Sklearn

Declares X, y variables into the model.

X = df5[['Shift_RH', 'Shift_T']].values
y = df5['CO(GT)'].values

I divide the collection into training variables and test variables.

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

Comparison of variables from the model with real variables.

dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)

from sklearn import metrics

dfKK.head(50).plot()

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

from sklearn import metrics

print('Mean Absolute Error:    ', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:     ', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.15437562015505324

Carbon monoxide contamination cannot be predicted based on humidity and temperature.
In the next part, we will continue the analysis and preparation of data for linear regression.

What is this the correlation shift?

In supervised deep machine learning we have two directions: classification and regression. Regression needs continuous values of data. Because from time to time we are forced to transform discrete data into continues values.

More important, have to say, is to find linear correlation between independent variables and dependent variable who represents result.

How to find correlation?

In natural environment everything is correlated each other. Rain causes the level of the lake to rise. Hot sun causes of the level of the lake to down. It is obvious examples of linear correlation.

But to observe it, use simple correlation can be insufficient.

The problem is in the shift. Rain contribute to rise water in river but this rise appears after couple of hours. Sun makes level of water to down after couple of days. Frankly speaking most correlation in all environments have longer or shorter delays.

How to find correlation shift?

from scipy import signal, fftpack
import pandas as pd
import numpy

Let’s build this dataframe.

AAA = [295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 175, 250,
       190, 265, 205, 280, 220, 295, 235, 310, 250, 325, 265, 340, 280,
       355, 295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 445,
       385, 460, 400, 475, 415, 490, 430, 175, 250, 190, 265, 205, 280,
       220, 295, 235, 310, 250, 325, 265, 340, 280, 355, 295, 370, 310,
       385, 325, 400, 340, 415, 355, 430, 370, 445, 385, 460, 400, 475,
       415, 490, 430, 505, 445, 175, 250, 190, 265, 205, 280, 220, 295,
       235, 310, 250, 325, 265, 340, 280, 355]

BBB = [123, 221, 113, 105, 150, 114, 159, 123, 168, 132, 177, 141, 186,
       150, 195, 159, 204, 168, 213, 177, 222, 186, 231, 195, 240, 204,
       249, 213, 258, 222, 267, 231, 276, 240, 285, 249, 294, 258, 105,
       150, 114, 159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204,
       168, 213, 177, 222, 186, 231, 195, 240, 204, 249, 213, 258, 222,
       267, 231, 276, 240, 285, 249, 294, 258, 303, 267, 105, 150, 114,
       159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204, 168, 213,
       177, 222, 186, 231, 195, 240, 204, 249]

CCC = [124, 154, 130, 160, 136, 166, 142, 172, 148,  70, 100,  76, 106,
        82, 112,  88, 118,  94, 124, 100, 130, 106, 136, 112, 142, 118,
       148, 124, 154, 130, 160, 136, 166, 142, 172, 148, 178, 154, 184,
       160, 190, 166, 196, 172,  70, 100,  76, 106,  82, 112,  88, 118,
        94, 124, 100, 130, 106, 136, 112, 142, 118, 148, 124, 154, 130,
       160, 136, 166, 142, 172, 148, 178, 154, 184, 160, 190, 166, 196,
       172, 202, 178,  70, 100,  76, 106,  82, 112,  88, 118,  94, 124,
       100, 130, 106, 136, 112, 142, 118, 148]

DDD = [ 437,  453,  764,  346,  239,  420,  600,  456,  636,  492,  672,
        528,  708,  564,  744,  600,  780,  636,  816,  672,  852,  708,
        888,  744,  924,  780,  960,  816,  996,  852, 1032,  888, 1068,
        924, 1104,  960, 1140,  996, 1176, 1032,  420,  600,  456,  636,
        492,  672,  528,  708,  564,  744,  600,  780,  636,  816,  672,
        852,  708,  888,  744,  924,  780,  960,  816,  996,  852, 1032,
        888, 1068,  924, 1104,  960, 1140,  996, 1176, 1032, 1212, 1068,
        420,  600,  456,  636,  492,  672,  528,  708,  564,  744,  600,
        780,  636,  816,  672,  852,  708,  888,  744,  924,  780,  960]

RESULT = [ 35,  50,  38,  53,  41,  56,  44,  59,  47,  62,  50,  65,  53,
        68,  56,  71,  59,  74,  62,  77,  65,  80,  68,  83,  71,  86,
        74,  89,  77,  92,  80,  95,  83,  98,  86,  35,  50,  38,  53,
        41,  56,  44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,
        74,  62,  77,  65,  80,  68,  83,  71,  86,  74,  89,  77,  92,
        80,  95,  83,  98,  86, 101,  89,  35,  50,  38,  53,  41,  56,
        44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,  74,  62,
        77,  65,  80,  68,  83,  71,  86,  74]


df = pd.DataFrame({'AAA': AAA, 'BBB': BBB,'CCC':CCC,'DDD':DDD, 'RESULT':RESULT})

df.head()

Descriptive in the DataFrame phenomena are perfectly correlated. But we don’t know about it. Now we use ordinary method of searching correlation.

corr = df.corr()
corr

corr['RESULT']

AAA      -0.261955
BBB       0.383326
CCC      -0.169164
DDD       0.248511
RESULT    1.000000
Name: RESULT, dtype: float64

Is it all? Is it entire correlation for linear regression? How to find correlation delay?

Function to find optimal correlation shift

I made special function to detect optimal shift values for maximal linear correlation between dependent and independent variables.

def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF

We declare variables to function.

x = df.AAA       # independent variable
y = df.RESULT    # dependent variable
R = 20           # number of shifts who will be checked

Shift for variable AAA

We are looking optimal correlation shift in variable AAA.

SKO = shift_Factor(x,y,R)
print('Optimal shift for AAA: ',SKO)

Optimal shift for AAA:  11

0.9999999999999996

Optimal shift for BBB:  3

Optimal shift for CCC:  9

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

We calculate that in 11 rows of shifts there are the biggest correlations between AAA independent variable and RESULT variable (in absolute values). What is the level of correlation?

cross_corr(x, y, lag=SKO)

0.9999999999999996

We create new DateFrame with optimal shift.

def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)

df2 = df_shif(df, 'AAA', lag=SKO)
df2.rename(columns={'AAA':'SHIFTED AAA'}, inplace=True) 
df2.head(13)

Now we repeat these manuals for rest independent variables.

Shift for variable BBB

BBB = df.BBB       # independent variable

SKS = shift_Factor(BBB,y,R)
print('Optimal shift for BBB: ',SKS)

Optimal shift for BBB:  3

Optimal shift for CCC:  9

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

df3 = df_shif(df2, 'BBB', lag=SKS)
df3.rename(columns={'BBB':'SHIFTED BBB'}, inplace=True)

Shift for variable CCC

CCC = df.CCC

SKK = shift_Factor(CCC,y,R)
print('Optimal shift for CCC: ',SKK)

Optimal shift for CCC:  9

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

df4 = df_shif(df3, 'CCC', lag=SKK)
df4.rename(columns={'CCC':'SHIFTED CCC'}, inplace=True)

Shift for variable DDD

DDD = df.DDD

PKP = shift_Factor(DDD,y,R)
print('Optimal shift for DDD: ',PKP)

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

df5 = df_shif(df4, 'DDD', lag=PKP)
df5.rename(columns={'DDD':'SHIFTED DDD'}, inplace=True)

Correlation after make the shifts

I wipe rows in dataframe where appear NaN values and calculate correlation.

df5 = df5.dropna(how='any')
df5.head(3)

corr = df5.corr()
corr

corr['RESULT']

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

As we see, independent variables are perfectly correlated with result variable. This phenomenon was hidden because there were existing shifts.
I hope I convinced that researchers should enter rule of checking shifts during model making.

	Unnamed: 0	Date	Time	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
0	0	10/03/2004	18.00.00	2.6	1360.0	11.9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.6	48.9	0.7578
1	1	10/03/2004	19.00.00	2.0	1292.0	9.4	955.0	103.0	1174.0	92.0	1559.0	972.0	13.3	47.7	0.7255
2	2	10/03/2004	20.00.00	2.2	1402.0	9.0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11.9	54.0	0.7502

	DATE	Month	Weekday	Weekday_name	Hours
6109	2004-11-20 07:00:00	11	5	Saturday	7
3537	2004-08-05 03:00:00	8	3	Thursday	3
8053	2005-02-09 07:00:00	2	2	Wednesday	7

	DATE	CO(GT)	RH	T	weather_time
0	2004-03-10 18:00:00	2.6	48.9	13.6	2004-03-10 18:00:00
1	2004-03-10 19:00:00	2.0	47.7	13.3	2004-03-10 19:00:00
2	2004-03-10 20:00:00	2.2	54.0	11.9	2004-03-10 20:00:00

	DATE	CO(GT)	Shift_RH	Shift_T	Shift_weather_time
36	2004-03-10 18:00:00	2.6	58.1	10.5	2004-03-11 06:00:00
37	2004-03-10 19:00:00	2.0	59.6	10.2	2004-03-11 07:00:00
38	2004-03-10 20:00:00	2.2	57.4	10.8	2004-03-11 08:00:00
39	2004-03-10 21:00:00	2.2	60.6	10.5	2004-03-11 09:00:00
40	2004-03-10 22:00:00	1.6	58.4	10.8	2004-03-11 10:00:00

	CO(GT) Actual	CO(GT)_Predicted
0	0.5	1.63
1	1.9	1.91
2	3.4	2.40
3	1.2	1.45
4	2.4	2.40

shift - THE DATA SCIENCE LIBRARY

Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334)

Part 2. Simple multifactorial linear regression

Step 1. Launching the time variable

Step 2. We add more columns based on the time variable

Graphical analysis of pollution according to time variables

Step 3. Correlation analysis¶

CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)

Step 4. We are now checking shift

Variable RH – Relative humidity (
We check a variable with very low correlation with the resulting CO (GT) variable

Variable AH – Absolute humidity

Variable: Temperature in ° C¶

We are now creating a new DataFrame with a 12-hour shift

Graphical analysis of relationships – Humidity and temperature to carbon monoxide

Step 5. Building a multiple linear regression model in Sklearn

Example of the use of shift for linear regression in Python. How to find optimal correlation shift?

What is this the correlation shift?

How to find correlation?

How to find correlation shift?

Function to find optimal correlation shift

Shift for variable AAA

Shift for variable BBB

Shift for variable CCC

Shift for variable DDD

Correlation after make the shifts

	CO(GT)
PT08.S3(NOx)	-0.715683
Weekday	-0.140231
RH	0.020122
AH	0.025227
T	0.025639
Month	0.112291
Hours	0.344071
PT08.S4(NO2)	0.631854
NO2(GT)	0.682774
NOx(GT)	0.773677
PT08.S5(O3)	0.858762
PT08.S1(CO)	0.886114
PT08.S2(NMHC)	0.918386
C6H6(GT)	0.932584
CO(GT)	1.000000

	AAA	BBB	CCC	DDD	RESULT
0	295	123	124	437	35
1	370	221	154	453	50
2	310	113	130	764	38
3	385	105	160	346	53
4	325	150	136	239	41

	AAA	BBB	CCC	DDD	RESULT
AAA	1.000000	0.072278	0.715892	0.206945	-0.261955
BBB	0.072278	1.000000	0.244349	0.748050	0.383326
CCC	0.715892	0.244349	1.000000	0.389072	-0.169164
DDD	0.206945	0.748050	0.389072	1.000000	0.248511
RESULT	-0.261955	0.383326	-0.169164	0.248511	1.000000

	SHIFTED AAA	BBB	CCC	DDD	RESULT
0	295	NaN	NaN	NaN	NaN
1	370	NaN	NaN	NaN	NaN
2	310	NaN	NaN	NaN	NaN
3	385	NaN	NaN	NaN	NaN
4	325	NaN	NaN	NaN	NaN
5	400	NaN	NaN	NaN	NaN
6	340	NaN	NaN	NaN	NaN
7	415	NaN	NaN	NaN	NaN
8	355	NaN	NaN	NaN	NaN
9	430	NaN	NaN	NaN	NaN
10	370	NaN	NaN	NaN	NaN
11	175	123.0	124.0	437.0	35.0
12	250	221.0	154.0	453.0	50.0

	SHIFTED AAA	SHIFTED BBB	SHIFTED CCC	SHIFTED DDD	RESULT
28	175.0	105.0	70.0	420.0	35.0
29	250.0	150.0	100.0	600.0	50.0
30	190.0	114.0	76.0	456.0	38.0

	SHIFTED AAA	SHIFTED BBB	SHIFTED CCC	SHIFTED DDD	RESULT
SHIFTED AAA	1.0	1.0	1.0	1.0	1.0
SHIFTED BBB	1.0	1.0	1.0	1.0	1.0
SHIFTED CCC	1.0	1.0	1.0	1.0	1.0
SHIFTED DDD	1.0	1.0	1.0	1.0	1.0
RESULT	1.0	1.0	1.0	1.0	1.0

	AAA	BBB	CCC	DDD	RESULT
0	295	123	124	437	35
1	370	221	154	453	50
2	310	113	130	764	38
3	385	105	160	346	53
4	325	150	136	239	41

shift - THE DATA SCIENCE LIBRARY

Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334)

Part 2. Simple multifactorial linear regression

Step 1. Launching the time variable

Step 2. We add more columns based on the time variable

Graphical analysis of pollution according to time variables

Step 3. Correlation analysis¶

CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)

Step 4. We are now checking shift

Variable RH – Relative humidity ( We check a variable with very low correlation with the resulting CO (GT) variable

Variable AH – Absolute humidity

Variable: Temperature in ° C¶

We are now creating a new DataFrame with a 12-hour shift

Graphical analysis of relationships – Humidity and temperature to carbon monoxide

Step 5. Building a multiple linear regression model in Sklearn

Example of the use of shift for linear regression in Python. How to find optimal correlation shift?

What is this the correlation shift?

How to find correlation?

How to find correlation shift?

Function to find optimal correlation shift

Shift for variable AAA

Shift for variable BBB

Shift for variable CCC

Shift for variable DDD

Correlation after make the shifts

Variable RH – Relative humidity (
We check a variable with very low correlation with the resulting CO (GT) variable

	AAA	BBB	CCC	DDD	RESULT
0	295	123	124	437	35
1	370	221	154	453	50
2	310	113	130	764	38
3	385	105	160	346	53
4	325	150	136	239	41