Part 2. Simple multifactorial linear regression
In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed measurement data file has been created, which we will now open.
We will now continue to prepare data for further analysis. One of the most important variables describing in linear regression is time. Most artificial and natural phenomena operate in hourly, daily and monthly cycles.
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)
Step 1. Launching the time variable
We check what the date format is
df[['Date','Time']].dtypes
There is no date format in dataframe. Link columns containing time.
df['DATE'] = df['Date']+' '+df['Time']
df['DATE'].head()
We create a new column containing the date and time. Then we convert the object format to the date format.
df['DATE'] = pd.to_datetime(df.DATE, format='df.dtypes
Step 2. We add more columns based on the time variable
In industry, the day of the week is very important, so in such models it is worth adding a column with the number of the day.
df['Month'] = df['DATE'].dt.month
df['Weekday'] = df['DATE'].dt.weekday
df['Weekday_name'] = df['DATE'].dt.weekday_name
df['Hours'] = df['DATE'].dt.hour
df[['DATE','Month','Weekday','Weekday_name','Hours']].sample(3)
Graphical analysis of pollution according to time variables
df.pivot_table(index='Weekday_name', values='CO(GT)', aggfunc='mean').plot(kind='bar')
df.pivot_table(index='Month', values='CO(GT)', aggfunc='mean').plot(kind='bar')
df.pivot_table(index='Hours', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Step 3. Correlation analysis¶
we set the result variable as:
CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)
del df['Unnamed: 0']
CORREL = df.corr()
PKP = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PKP
import matplotlib.pyplot as plt
plt.figure(figsize=(10,8))
PKP.plot(kind='barh', color='red')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')
Variables based on time are not very well correlated with the described variable: CO (GT).
The temptation arises to use better, better correlated independent variables for the model. The problem is that these variables can be part of the result. So if there is pollution then all substances are in the air.
Our task is to examine how weather and time affect the level of pollution. We’ll cover this task in the next part of the tutorial.
Step 4. We are now checking shift
for independent variables with low direct correlation.
How does weather affect CO2 levels?
In [13]:
def cross_corr(x, y, lag=0):
return x.corr(y.shift(lag))
def shift_Factor(x,y,R):
x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
# R factor is the number of the shifts who should be checked by the function
Kot = pd.DataFrame(list(x_corr)).reset_index()
Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
# We find optimal correlation shift
Kot['abs'] = Kot['Corr'].abs()
SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
p1 = SF.to_frame()
SF = p1.Shift_num.max()
return SF
def cross_corr(x, y, lag=0):
return x.corr(y.shift(lag))
def shift_Factor(x,y,R):
x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
# R factor is the number of the shifts who should be checked by the function
Kot = pd.DataFrame(list(x_corr)).reset_index()
Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
# We find optimal correlation shift
Kot['abs'] = Kot['Corr'].abs()
SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
p1 = SF.to_frame()
SF = p1.Shift_num.max()
return SF
x = df.RH # independent variable
y = df['CO(GT)'] # dependent variable
R = 20 # number of shifts who will be checked
SKO = shift_Factor(x,y,R)
print('Optimal shift for RH: ',SKO)
cross_corr(x, y, lag=SKO)
Variable AH – Absolute humidity
We check a variable with very low correlation with the resulting CO (GT) variable
x = df.AH # independent variable
SKP = shift_Factor(x,y,R)
print('Optimal shift for AH: ',SKP)
cross_corr(x, y, lag=SKP)
Absolute humidity AH does not correlate with the variable CO (GT) so we eliminate it from the model
Variable: Temperature in ° C¶
We check a variable with very low correlation with the resulting CO (GT) variable.
x = df['T'] # independent variable
PKP = shift_Factor(x,y,R)
print('Optimal shift for T: ',PKP)
cross_corr(x, y, lag=PKP)
We are now creating a new DataFrame with a 12-hour shift
It turns out that temperature and humidity only correlate after 12 hours from the time the CO contamination changes.
Data shift creation function.
def df_shif(df, target=None, lag=0):
if not lag and not target:
return df
new = {}
for h in df.columns:
if h == target:
new[h] = df[target]
else:
new[h] = df[h].shift(periods=lag)
return pd.DataFrame(data=new)
Our goal is to create a multiple regression model:
- Independent variables are: Temperature (T) and Relative humidity RH
- The dependent variable is the level of CO (GT)
df2 = df[['DATE', 'CO(GT)','RH', 'T']]
Adds a date and time to record temperature and humidity.
df2['weather_time'] = df2['DATE']
df2.head(3)
df3 = df_shif(df2, 'weather_time', lag=12)
df3.rename(columns={'weather_time':'Shift_weather_time'}, inplace=True)
df3.head(13)
df4 = df_shif(df3, 'RH', lag=12)
df4.rename(columns={'RH':'Shift_RH'}, inplace=True)
df5 = df_shif(df4, 'T', lag=12)
df5.rename(columns={'T':'Shift_T'}, inplace=True)
Deletes rows with incomplete data.
df5 = df5.dropna(how ='any')
df5.head()
The table can be understood as meaning that a specific temperature at 6:00 gives a specific concentration of carbon monoxide at 18:00.
Graphical analysis of relationships – Humidity and temperature to carbon monoxide
It looks rather poor
import matplotlib.pyplot as plt
df5.plot(x='Shift_T', y='CO(GT)', style='o')
plt.title('Shift_T vs CO(GT)')
plt.xlabel('Shift_T')
plt.ylabel('CO(GT)')
plt.show()
df5.plot(x='Shift_RH', y='CO(GT)', style='o')
plt.title('Shift_RH vs CO(GT)')
plt.xlabel('Shift_RH')
plt.ylabel('CO(GT)')
plt.show()
Step 5. Building a multiple linear regression model in Sklearn
Declares X, y variables into the model.
X = df5[['Shift_RH', 'Shift_T']].values
y = df5['CO(GT)'].values
I divide the collection into training variables and test variables.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
I am building a regression model.
regressor = LinearRegression()
regressor.fit(X_train, y_train)
import numpy as np
y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
Comparison of variables from the model with real variables.
dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)
from sklearn import metrics
dfKK.head(50).plot()
from sklearn import metrics
print('Mean Absolute Error: ', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error: ', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('Mean Squared Error: ', metrics.r2_score(y_test, y_pred))
Carbon monoxide contamination cannot be predicted based on humidity and temperature.
In the next part, we will continue the analysis and preparation of data for linear regression.