In the previous part, we tried to build a model by trying to explain the level of carbon monoxide pollution based on temperature and pressure. Despite the use of shift, the model proved to be inefficient. We will return to this model later. Meanwhile, we will now perform a linear regression model in Sklearn and Tensorflow.
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled2.csv')
df.head(3)
del df['Unnamed: 0']
Checking the correlation with the result variable
We will consider carbon monoxide CO pollution as the result variable
CORREL = df.corr().sort_values('CO(GT)')
PCK = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PCK
import matplotlib.pyplot as plt
plt.figure(figsize=(10,8))
PCK.plot(kind='barh', color='#6d9eeb')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')
Part one: Linear regression in Sklearn
Declares X, y variables into the model. To the set of describing variables does not give data in text and date format.
df.columns
X = df[['PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
,'Month','Weekday','Hours']].values
y = df['CO(GT)'].values
I divide the collection into training variables and test variables.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
I am building a regression model.
regressor = LinearRegression()
regressor.fit(X_train, y_train)
import numpy as np
y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
Comparison of variables from the model with real variables.
dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)
dfKK.to_csv('c:/TF/AAr2.csv')
from sklearn import metrics
dfKK.head(50).plot()
print('Mean Squared Error: ', metrics.r2_score(y_test, y_pred))
The R Square parameter shows that the model has close to perfection fit into empirical variables.
How to calculate the R Square parameter?
In the SKlearn package, the R-Square parameter is calculated easily and pleasantly. This is not always the case, for example in the TensorFlow library the calculation of this indicator is difficult. Therefore, we will now calculate the R-Square index on foot.
It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.
We will now calculate this difficult formula step by step for our example of linear regression.
Point 1. We calculate the square error
The quadratic error is the difference between the actual value and the estimation. We put this difference squared.</span>
dfKK.head(5)
dfKK['SSE'] = (dfKK['CO(GT) Actual'] - dfKK['CO(GT)_Predicted'])**2
dfKK.head(3)
Point 2. We calculate the average empirical value of y
dfKK['ave_y'] = dfKK['CO(GT) Actual'].mean()
dfKK.head(3)
Point 3. We calculate the difference between empirical values y and the average of empirical values y
dfKK['SST'] = (dfKK['CO(GT) Actual'] - dfKK['ave_y'])**2
dfKK.head(3)
Point 4. We calculate the difference between sum of SST and sum of SSE¶
Sum_SST = dfKK['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = dfKK['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE
Point 5. We calculate the R Square parameter
r2 = SSR/Sum_SST
print('R Square parameter: ',r2)