Parking Birmingham occupancy analysis

Source of data: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham

import pandas as pd

df = pd.read_csv('c:/TF/ParkingBirmingham.csv')
df.head(3)

We have parking lots in Birmingham?

df['SystemCodeNumber'].value_counts()

BHMNCPHST01         1312
Broad Street        1312
BHMBCCMKT01         1312
BHMEURBRD01         1312
Shopping            1312
Others-CCCPS119a    1312
Others-CCCPS105a    1312
BHMNCPNST01         1312
Others-CCCPS98      1312
BHMMBMMBX01         1312
Others-CCCPS135a    1312
Others-CCCPS202     1312
BHMBCCTHL01         1312
Others-CCCPS8       1312
BHMBCCSNH01         1294
Others-CCCPS133     1294
BHMNCPLDH01         1292
BHMNCPPLS01         1291
BHMBCCPST01         1276
BHMEURBRD02         1276
NIA South           1204
NIA Car Parks       1204
BHMNCPRAN01         1186
BHMBRCBRG02         1186
BHMBRCBRG03         1186
Bull Ring           1186
BHMBRCBRG01         1186
BHMNCPNHS01         1038
NIA North            162
BHMBRTARC01           88
Name: SystemCodeNumber, dtype: int64

Checking the completeness of the data

df.isnull().sum()

SystemCodeNumber    0
Capacity            0
Occupancy           0
LastUpdated         0
dtype: int64

Checking data type¶

df.dtypes

SystemCodeNumber    object
Capacity             int64
Occupancy            int64
LastUpdated         object
dtype: object

Needs a date variable because it probably reflects the parking space best.

df.LastUpdated = pd.to_datetime(df.LastUpdated)
df.dtypes

SystemCodeNumber            object
Capacity                     int64
Occupancy                    int64
LastUpdated         datetime64[ns]
dtype: object

I create date-related data.

df['month'] = df.LastUpdated.dt.month
df['hour'] = df.LastUpdated.dt.hour
df['weekday_name'] = df.LastUpdated.dt.weekday_name
df['weekday'] = df.LastUpdated.dt.weekday

df.head(4)

I choose parking for analysis.

# BHMBCCMKT01
# BHMNCPNST01
# BHMBCCTHL01 (48)
# BHMMBMMBX01 (34)
# BHMBCCSNH01
df2 = df.loc[df['SystemCodeNumber']=='BHMBCCTHL01'] 
df2.shape

(1312, 8)

Sklearn linear regression model

X = df2[['month', 'hour', 'weekday'] ].values
y = df2['Occupancy'].values

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.47651714547905455

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

3882    57
3883    58
3884    58
Name: HoW, dtype: int8

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Squared Error:      0.48074077157048334

The value of the r square indicator for the linear regression model of sklearn is a little better but the model is still very weak.

Improving the linear regression model

Let’s start by checking the correlation with the dependent variable.

CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')

One of the basic conditions for building a linear regression model is the linear relationship between dependent and independent variables.
It is known that when we are in the city and want to park, the hour is very important. It is difficult to park at peak times. But at the same hour on Saturday there are no cars.
Create a synthetic variable from a combination of weekday and hour. It is known that the chosen hour on Sunday is not the same as the same hour on Wednesday.

We create a synthetic variable: Hour on Weekday (HoW)

df2['combined_col'] = df2[['weekday_name', 'hour']].astype(str).apply('-'.join, axis=1)
df2['HoW'] = pd.Categorical(df2['combined_col']).codes
df2['HoW'].head(3)

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

3882    57
3883    58
3884    58
Name: HoW, dtype: int8

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Squared Error:      0.48074077157048334

We’ve created a synthetic variable that is a column containing the coded combination of day of the week and time. We can easily read this code.

df2[['HoW', 'weekday_name', 'hour']].head(4)

CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')

Now I am adding a new synthetic variable to the linear regression model of the sklearn.

X = df2[['month', 'hour', 'weekday','HoW'] ].values
y = df2['Occupancy'].values

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.48074077157048334

As we can see, the introduction of a synthetic variable slightly improved the quality of the model.

Polynomial regression model

Polynomial regression model is technically a special case of multiple linear regression. This is definitions and explanation found in Wikipedia: https://en.wikipedia.org/wiki/Polynomial_regression.
Just quote the explanations found there to introduce you to the topic of polynomial regression.
Can be used in any linear equation for both linear regression and linear classification.

Polynomial regression model can be used in any linear equation for both linear regression and linear classification.

The goal of regression analysis is to model the expected value of a dependent variable y in terms of the value of an independent variable (or vector of independent variables) x. In simple linear regression, the model is used, where ε is an unobserved random error with mean zero conditioned on a scalar variable x. In this model, for each unit increase in the value of x, the conditional expectation of y increases by β1 units.

${displaystyle y=beta _{0}+beta _{1}x+varepsilon ,,}$

In many settings, such a linear relationship may not hold. For example, if we are modeling the yield of a chemical synthesis in terms of the temperature at which the synthesis takes place, we may find that the yield improves by increasing amounts for each unit increase in temperature. In this case, we might propose a quadratic model of the form

$ {displaystyle y=beta _{0}+beta _{1}x+beta _{2}x^{2}+varepsilon .,}$

In general, we can model the expected value of y as an nth degree polynomial, yielding the general polynomial regression model:

$ {displaystyle y=beta _{0}+beta _{1}x+beta _{2}x^{2}+beta _{3}x^{3}+cdots +beta _{n}x^{n}+varepsilon .,}$

Conveniently, these models are all linear from the point of view of estimation, since the regression function is linear in terms of the unknown parameters β0, β1, …. Therefore, for least squares analysis, the computational and inferential problems of polynomial regression can be completely addressed using the techniques of multiple regression. This is done by treating x, x2, … as being distinct independent variables in a multiple regression model.

Although polynomial regression is technically a special case of multiple linear regression, the interpretation of a fitted polynomial regression model requires a somewhat different perspective. It is often difficult to interpret the individual coefficients in a polynomial regression fit, since the underlying monomials can be highly correlated. For example, x and x2 have correlation around 0.97 when x is uniformly distributed on the interval (0, 1). Although the correlation can be reduced by using orthogonal polynomials, it is generally more informative to consider the fitted regression function as a whole.

Source: https://en.wikipedia.org/wiki/Polynomial_regression

Simply put, there are phenomena that are seasonal, winding in time. Such phenomena can be described by means of a linear function but the regression equation will be inefficient, the model will be imprecise.
Normal linear regression equation:

${displaystyle y=beta _{0}+beta _{1}x+varepsilon ,,}$

The use of a quadratic variable in a regression equation can greatly improve model properties:

$ {displaystyle y=beta _{0}+beta _{1}x+beta _{2}x^{2}+varepsilon .,}$

Source of plots: https://www.guru99.com/linear-classifier-tensorflow.html

Polynomial regression model in practice

Tutorial: Linear Regression – Time variables and shifts (#2)

Recently, I published a situation on my blog where I had to predict CO air pollution. The describing variables were temperature and humidity. Despite the use of shift, it was not possible to build an effective model (r square was 0.15).

I saved the data from that study: AirQ_shift

import pandas as pd

df = pd.read_csv('c:/TF/AirQ_shift.csv')
df.head(3)

Linear regression model:

${displaystyle y=beta _{0}+beta _{1}x+varepsilon,}$

Declares X, y variables into the model.

X = df[['Shift_RH', 'Shift_T']].values
y = df['CO(GT)'].values

I divide the collection into training variables and test variables.

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.15437562015505324

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Squared Error:      0.15892318270410333

Polynomial regression model:

$ {displaystyle y=beta _{0}+beta _{1}x+beta _{2}x^{2}+beta _{3}z+beta _{4}z^{2}+varepsilon ,}$

To build a polynomial regression model, you need to square independent variables: temperature and absolute humidity.
These variables squared will be treated as separate variables in the model.

So I create two new variables.

df['Shift_RH^2']=df['Shift_RH']**2
df['Shift_T^2']=df['Shift_T']**2

df.head(3)

We create polynomial model:

Declares new X, y variables into the model.

X = df[['Shift_RH', 'Shift_T', 'Shift_RH^2', 'Shift_T^2']].values
y = df['CO(GT)'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.15892318270410333

The change is unnoticed, econometric tools do not always improve the model significantly. What counts above all is the ability to apply the technique.

In the previous part, we tried to build a model by trying to explain the level of carbon monoxide pollution based on temperature and pressure. Despite the use of shift, the model proved to be inefficient. We will return to this model later. Meanwhile, we will now perform a linear regression model in Sklearn and Tensorflow.

import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled2.csv')
df.head(3)

del df['Unnamed: 0']

Checking the correlation with the result variable

We will consider carbon monoxide CO pollution as the result variable

CORREL = df.corr().sort_values('CO(GT)')
PCK = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PCK

import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
PCK.plot(kind='barh', color='#6d9eeb')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')

Text(0, 0.5, 'Continuous independent variables')

Part one: Linear regression in Sklearn

Declares X, y variables into the model. To the set of describing variables does not give data in text and date format.

df.columns

Index(['Unnamed: 0.1', 'Unnamed: 0.1.1', 'Date', 'Time', 'CO(GT)',
       'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)',
       'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH', 'DATE',
       'Month', 'Weekday', 'Weekday_name', 'Hours'],
      dtype='object')

X = df[['PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
        ,'Month','Weekday','Hours']].values
y = df['CO(GT)'].values

I divide the collection into training variables and test variables.

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

Comparison of variables from the model with real variables.

dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)

dfKK.to_csv('c:/TF/AAr2.csv')

from sklearn import metrics

dfKK.head(50).plot()

Mean Squared Error:      0.9202224478702138

Sum_SST : 3829.9533119658117
Sum_SSE : 305.5443

R Square parameter:  0.9202224478702138

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.9202224478702138

Sum_SST : 3829.9533119658117
Sum_SSE : 305.5443

R Square parameter:  0.9202224478702138

The R Square parameter shows that the model has close to perfection fit into empirical variables.

How to calculate the R Square parameter?

In the SKlearn package, the R-Square parameter is calculated easily and pleasantly. This is not always the case, for example in the TensorFlow library the calculation of this indicator is difficult. Therefore, we will now calculate the R-Square index on foot.

It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

We will now calculate this difficult formula step by step for our example of linear regression.

Point 1. We calculate the square error

The quadratic error is the difference between the actual value and the estimation. We put this difference squared.

dfKK.head(5)

dfKK['SSE'] = (dfKK['CO(GT) Actual'] - dfKK['CO(GT)_Predicted'])**2
dfKK.head(3)

Point 2. We calculate the average empirical value of y

dfKK['ave_y'] = dfKK['CO(GT) Actual'].mean()
dfKK.head(3)

Point 3. We calculate the difference between empirical values y and the average of empirical values y

dfKK['SST'] = (dfKK['CO(GT) Actual'] - dfKK['ave_y'])**2
dfKK.head(3)

Point 4. We calculate the difference between sum of SST and sum of SSE¶

Sum_SST = dfKK['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = dfKK['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE

Sum_SST : 3829.9533119658117
Sum_SSE : 305.5443

R Square parameter:  0.9202224478702138

Point 5. We calculate the R Square parameter

r2 = SSR/Sum_SST
print('R Square parameter: ',r2)

R Square parameter:  0.9202224478702138

We continue to learn how to build multiple linear regression models. This time we will build a model using the Tensorflow library. As before, the data file: AirQ_filled2.csv comes from previous episodes of this cycle.

import tensorflow as tf
import pandas as pd

df = pd.read_csv('c:/TF/AirQ_filled2.csv', usecols=['CO(GT)','PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
        ,'Month','Weekday','Hours'])
df.head(3)

Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'Month', 'Weekday', 'Hours'],
      dtype='object')

CO_GT           float64
PT08.S1_CO      float64
C6H6_GT         float64
PT08.S2_NMHC    float64
NOx_GT          float64
PT08.S3_NOx     float64
NO2_GT          float64
PT08.S4_NO2     float64
PT08.S5_O3      float64
T               float64
RH              float64
AH              float64
Month             int64
Weekday           int64
Hours             int64
dtype: object

[_NumericColumn(key='PT08.S1_CO', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='C6H6_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S2_NMHC', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NOx_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S3_NOx', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NO2_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S4_NO2', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S5_O3', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='T', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='RH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='AH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Month', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Weekday', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Hours', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'Air', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

(7486, 15) (1871, 15)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-10000
INFO:tensorflow:Saving checkpoints for 10001 into Airmodel.ckpt.
INFO:tensorflow:loss = 27.90989, step = 10001
INFO:tensorflow:global_step/sec: 231.067
INFO:tensorflow:loss = 19.266008, step = 10101 (0.443 sec)
INFO:tensorflow:global_step/sec: 250.047
INFO:tensorflow:loss = 21.174185, step = 10201 (0.389 sec)
INFO:tensorflow:global_step/sec: 244.378
INFO:tensorflow:loss = 26.823406, step = 10301 (0.409 sec)
INFO:tensorflow:global_step/sec: 263.037
INFO:tensorflow:loss = 16.690845, step = 10401 (0.380 sec)
INFO:tensorflow:global_step/sec: 250.698
INFO:tensorflow:loss = 24.08421, step = 10501 (0.399 sec)
INFO:tensorflow:global_step/sec: 254.447
INFO:tensorflow:loss = 16.630123, step = 10601 (0.406 sec)
INFO:tensorflow:global_step/sec: 248.812
INFO:tensorflow:loss = 25.998842, step = 10701 (0.389 sec)
INFO:tensorflow:global_step/sec: 269.371
INFO:tensorflow:loss = 31.432064, step = 10801 (0.387 sec)
INFO:tensorflow:global_step/sec: 255.634
INFO:tensorflow:loss = 22.70269, step = 10901 (0.391 sec)
INFO:tensorflow:Saving checkpoints for 11000 into Airmodel.ckpt.
INFO:tensorflow:Loss for final step: 24.21025.

INFO:tensorflow:Starting evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
INFO:tensorflow:Finished evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Saving dict for global step 11000: average_loss = 0.18934268, global_step = 11000, loss = 59.04336

Loss: 59.043362

Step 1: Convert Data

We convert numeric variables in the correct Tensorflow format. Tensorflow provides a continuous variable conversion method: tf.feature_column.numeric_column ().

Separation of a column into an independent variable and a dependent variable.

df.columns

Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'Month', 'Weekday', 'Hours'],
      dtype='object')

df.columns = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']

df.dtypes

CO_GT           float64
PT08.S1_CO      float64
C6H6_GT         float64
PT08.S2_NMHC    float64
NOx_GT          float64
PT08.S3_NOx     float64
NO2_GT          float64
PT08.S4_NO2     float64
PT08.S5_O3      float64
T               float64
RH              float64
AH              float64
Month             int64
Weekday           int64
Hours             int64
dtype: object

FEATURES = ['PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
LABEL = 'CO_GT'

PKS = [tf.feature_column.numeric_column(k) for k in FEATURES]
PKS

[_NumericColumn(key='PT08.S1_CO', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='C6H6_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S2_NMHC', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NOx_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S3_NOx', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NO2_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S4_NO2', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S5_O3', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='T', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='RH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='AH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Month', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Weekday', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Hours', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Step 2: Defining the estimator

Tensorflow will automatically create a file called „Air” in your working directory. You must use this path to access Tensorboard. The estimator applies to independent variables.

estimator = tf.estimator.LinearRegressor(    
        feature_columns=PKS,   
        model_dir="Air")

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'Air', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': , '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

(7486, 15) (1871, 15)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-10000
INFO:tensorflow:Saving checkpoints for 10001 into Airmodel.ckpt.
INFO:tensorflow:loss = 27.90989, step = 10001
INFO:tensorflow:global_step/sec: 231.067
INFO:tensorflow:loss = 19.266008, step = 10101 (0.443 sec)
INFO:tensorflow:global_step/sec: 250.047
INFO:tensorflow:loss = 21.174185, step = 10201 (0.389 sec)
INFO:tensorflow:global_step/sec: 244.378
INFO:tensorflow:loss = 26.823406, step = 10301 (0.409 sec)
INFO:tensorflow:global_step/sec: 263.037
INFO:tensorflow:loss = 16.690845, step = 10401 (0.380 sec)
INFO:tensorflow:global_step/sec: 250.698
INFO:tensorflow:loss = 24.08421, step = 10501 (0.399 sec)
INFO:tensorflow:global_step/sec: 254.447
INFO:tensorflow:loss = 16.630123, step = 10601 (0.406 sec)
INFO:tensorflow:global_step/sec: 248.812
INFO:tensorflow:loss = 25.998842, step = 10701 (0.389 sec)
INFO:tensorflow:global_step/sec: 269.371
INFO:tensorflow:loss = 31.432064, step = 10801 (0.387 sec)
INFO:tensorflow:global_step/sec: 255.634
INFO:tensorflow:loss = 22.70269, step = 10901 (0.391 sec)
INFO:tensorflow:Saving checkpoints for 11000 into Airmodel.ckpt.
INFO:tensorflow:Loss for final step: 24.21025.

INFO:tensorflow:Starting evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
INFO:tensorflow:Finished evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Saving dict for global step 11000: average_loss = 0.18934268, global_step = 11000, loss = 59.04336

Loss: 59.043362

INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000

[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)

dtype('float32')

To instruct Tensorflow how to feed the model, you can use pandas_input_fn. This object needs 5 parameters: x: function data y: label data batch_size: batch. Default 128 num_epoch: by default number of epochs 1 random: Random or not data. Default None

def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):    
         return tf.estimator.inputs.pandas_input_fn(       
         x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),       
         y = pd.Series(data_set[LABEL].values),       
         batch_size=n_batch,          
         num_epochs=num_epochs,       
         shuffle=shuffle)

Step 3: Model training

- To feed the model you can use the function created above: get_input_fn.
- Then you instruct the model to iterate 1000 times.
- Remember that you do not specify the number of epochs (num_epochs).
- It is better to set the number of epochs to none and define the number of iterations.

To test the model, we must divide the data set into a test set and a training set.

df_train=df.sample(frac=0.8,random_state=200)
df_test=df.drop(df_train.index)
print(df_train.shape, df_test.shape)

(7486, 15) (1871, 15)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-10000
INFO:tensorflow:Saving checkpoints for 10001 into Airmodel.ckpt.
INFO:tensorflow:loss = 27.90989, step = 10001
INFO:tensorflow:global_step/sec: 231.067
INFO:tensorflow:loss = 19.266008, step = 10101 (0.443 sec)
INFO:tensorflow:global_step/sec: 250.047
INFO:tensorflow:loss = 21.174185, step = 10201 (0.389 sec)
INFO:tensorflow:global_step/sec: 244.378
INFO:tensorflow:loss = 26.823406, step = 10301 (0.409 sec)
INFO:tensorflow:global_step/sec: 263.037
INFO:tensorflow:loss = 16.690845, step = 10401 (0.380 sec)
INFO:tensorflow:global_step/sec: 250.698
INFO:tensorflow:loss = 24.08421, step = 10501 (0.399 sec)
INFO:tensorflow:global_step/sec: 254.447
INFO:tensorflow:loss = 16.630123, step = 10601 (0.406 sec)
INFO:tensorflow:global_step/sec: 248.812
INFO:tensorflow:loss = 25.998842, step = 10701 (0.389 sec)
INFO:tensorflow:global_step/sec: 269.371
INFO:tensorflow:loss = 31.432064, step = 10801 (0.387 sec)
INFO:tensorflow:global_step/sec: 255.634
INFO:tensorflow:loss = 22.70269, step = 10901 (0.391 sec)
INFO:tensorflow:Saving checkpoints for 11000 into Airmodel.ckpt.
INFO:tensorflow:Loss for final step: 24.21025.

INFO:tensorflow:Starting evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
INFO:tensorflow:Finished evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Saving dict for global step 11000: average_loss = 0.18934268, global_step = 11000, loss = 59.04336

Loss: 59.043362

INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000

[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)

dtype('float32')

dtype('float32')

estimator.train(input_fn=get_input_fn(df_train,                                       
                                           num_epochs=None,                                      
                                           n_batch = 128,                                      
                                           shuffle=False),                                      
                                           steps=1000)

INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-10000
INFO:tensorflow:Saving checkpoints for 10001 into Airmodel.ckpt.
INFO:tensorflow:loss = 27.90989, step = 10001
INFO:tensorflow:global_step/sec: 231.067
INFO:tensorflow:loss = 19.266008, step = 10101 (0.443 sec)
INFO:tensorflow:global_step/sec: 250.047
INFO:tensorflow:loss = 21.174185, step = 10201 (0.389 sec)
INFO:tensorflow:global_step/sec: 244.378
INFO:tensorflow:loss = 26.823406, step = 10301 (0.409 sec)
INFO:tensorflow:global_step/sec: 263.037
INFO:tensorflow:loss = 16.690845, step = 10401 (0.380 sec)
INFO:tensorflow:global_step/sec: 250.698
INFO:tensorflow:loss = 24.08421, step = 10501 (0.399 sec)
INFO:tensorflow:global_step/sec: 254.447
INFO:tensorflow:loss = 16.630123, step = 10601 (0.406 sec)
INFO:tensorflow:global_step/sec: 248.812
INFO:tensorflow:loss = 25.998842, step = 10701 (0.389 sec)
INFO:tensorflow:global_step/sec: 269.371
INFO:tensorflow:loss = 31.432064, step = 10801 (0.387 sec)
INFO:tensorflow:global_step/sec: 255.634
INFO:tensorflow:loss = 22.70269, step = 10901 (0.391 sec)
INFO:tensorflow:Saving checkpoints for 11000 into Airmodel.ckpt.
INFO:tensorflow:Loss for final step: 24.21025.

INFO:tensorflow:Starting evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
INFO:tensorflow:Finished evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Saving dict for global step 11000: average_loss = 0.18934268, global_step = 11000, loss = 59.04336

Loss: 59.043362

INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000

[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)

dtype('float32')

dtype('float32')

y         float64
y_pred    float64
dtype: object

Step 4. Model evaluation

To enter a test set, use the following code:

ev = estimator.evaluate(    
          input_fn=get_input_fn(df_test,                          
          num_epochs=1,                          
          n_batch = 356,                          
          shuffle=False))

INFO:tensorflow:Starting evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
INFO:tensorflow:Finished evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Saving dict for global step 11000: average_loss = 0.18934268, global_step = 11000, loss = 59.04336

Loss: 59.043362

INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000

[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)

dtype('float32')

dtype('float32')

y         float64
y_pred    float64
dtype: object

dtype('float32')

dtype('float32')

Print the loss using by the code below:

loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))

Loss: 59.043362

INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000

[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)

dtype('float32')

dtype('float32')

y         float64
y_pred    float64
dtype: object

dtype('float32')

dtype('float32')

Calculation of R Square parameter using Tensorflow

I make a prediction on a test set

y = estimator.predict(    
         input_fn=get_input_fn(df_test,                          
         num_epochs=1,                          
         n_batch = 256,                          
         shuffle=False))

import itertools

predictions = list(p["predictions"] for p in itertools.islice(y, 1871))
#print("Predictions: {}".format(str(predictions)))

INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000

[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)

dtype('float32')

dtype('float32')

y         float64
y_pred    float64
dtype: object

dtype('float32')

dtype('float32')

R Square parameter:  0.90320766

predictions

[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

The model gave us a result string y. I am now processing this result string into a list.

import numpy as np

conc = np.vstack(predictions)
conc

array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)

ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)

kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype

dtype('float32')

Now I’m creating a list of real y values from the test set.

y = df_test['CO_GT'].values
y = y.astype('float32')
y.dtype

dtype('float32')

Now I create a dataframe with y-real and y-predicted variables.

PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes

y         float64
y_pred    float64
dtype: object

https://stackoverflow.com/questions/42351184/how-to-calculate-r2-in-tensorflow

def R_squared(y, y_pred):
    
  residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
  total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
  r2 = tf.subtract(1.0, tf.div(residual, total))
  return r2

https://blog.minitab.com/blog/adventures-in-statistics-2/regression-analysis-how-do-i-interpret-r-squared-and-assess-the-goodness-of-fit

To use this function, both variables must have the same data type.

y.dtype

dtype('float32')

kot.dtype

dtype('float32')

residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))

total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))

r2 = tf.subtract(1.0, tf.div(residual, total))

r2

sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)

R Square parameter:  0.90320766

Sum_SST : 3659.9984179583107
Sum_SSE : 354.26016629427124

R Square parameter:  0.903207562998923

Calculation of R Square parameter using Pandas

PZU.head(5)

PZU['SSE'] = (PZU['y'] - PZU['y_pred'])**2
PZU.head(3)

Point 2. We calculate the average empirical value of y

PZU['ave_y'] = PZU['y'].mean()
PZU.head(3)

Point 3. We calculate the difference between empirical values y and the average of empirical values y

PZU['SST'] = (PZU['y'] - PZU['ave_y'])**2
PZU.head(3)

Point 4. We calculate the difference between sum of SST and sum of SSE

Sum_SST = PZU['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = PZU['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE

Sum_SST : 3659.9984179583107
Sum_SSE : 354.26016629427124

R Square parameter:  0.903207562998923

Point 5. We calculate the R Square parameter

r2 = SSR/Sum_SST
print('R Square parameter: ',r2)

R Square parameter:  0.903207562998923

Part 1. Preliminary data preparation

AirQualityUCI

Source of data: https://archive.ics.uci.edu/ml/datasets/Air+Quality

import pandas as pd
df = pd.read_csv('c:/TS/AirQualityUCI.csv', sep=';')
df.head(3)

Data Set Information:

The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.
This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.

Supplementing data for further analysis

Attribute Information:

Date (DD/MM/YYYY)
Time (HH.MM.SS)
True hourly averaged concentration CO in mg/m^3 (reference analyzer)
PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
True hourly averaged NOx concentration in ppb (reference analyzer)
PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
Temperature in Â°C
Relative Humidity (
AH Absolute Humidity

Tutorial: Supplementing data for further analysis

Step 1. Data completeness check

df.isnull().sum()

Date              114
Time              114
CO(GT)            114
PT08.S1(CO)       114
NMHC(GT)          114
C6H6(GT)          114
PT08.S2(NMHC)     114
NOx(GT)           114
PT08.S3(NOx)      114
NO2(GT)           114
PT08.S4(NO2)      114
PT08.S5(O3)       114
T                 114
RH                114
AH                114
Unnamed: 15      9471
Unnamed: 16      9471
dtype: int64

There are a lot of missing values. In addition, we learned that the value -200 means no data. We’ll deal with this in a moment. We will now check the statistics of variables in the database.

df.agg(['min', 'max', 'mean', 'median'])

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesnumpylibnanfunctions.py:1112: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)

(9471, 17)

Date             114
Time             114
CO(GT)           114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
AH               114
dtype: int64

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Date                0
Time                0
CO(GT)           1683
PT08.S1(CO)       366
NMHC(GT)         8443
C6H6(GT)          366
PT08.S2(NMHC)     366
NOx(GT)          1639
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64

Date              object
Time              object
CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
dtype: object

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

df.shape

(9471, 17)

We delete two empty columns.

del df['Unnamed: 15']
del df['Unnamed: 16']

Step 1: Preliminary analysis of data gaps

One more look at how many NaN cells there are.

df.isnull().sum()

Date             114
Time             114
CO(GT)           114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
AH               114
dtype: int64

Now I will try to see these empty cells.

df[df['NMHC(GT)'].isnull()]

These are completely empty time series. The device was probably cut off from the power supply, no sensor was working.

df = df.dropna(how='all')
df.isnull().sum()

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

We are looking for variables with the value -200 because this means there is no data. The -200 values are entered differently, so I have to do the replacement process in many ways.

import numpy as np

df = df.replace(-200,np.NaN)
df = df.replace('-200',np.NaN)
df = df.replace('-200.0',np.NaN)
df = df.replace('-200,0',np.NaN)

The value of -200 has been changed to NaN and we will see how many empty records there are now.

df.isnull().sum()

Date                0
Time                0
CO(GT)           1683
PT08.S1(CO)       366
NMHC(GT)         8443
C6H6(GT)          366
PT08.S2(NMHC)     366
NOx(GT)          1639
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64

Chart of missing data structure.

import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show

The NMHC (GT) variable is the most incomplete, we eliminate it from research

del df['NMHC(GT)']

We displaying records with missing data – Function isna ()

df1 = df[df.isna().any(axis=1)]
df1

Step 2: Check the level of direct correlation to complete the data

CO (GT) there is no data there every few measurements, you have to check what this variable correlates with. I check the data type to make a correlation.

df.dtypes

Date              object
Time              object
CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
dtype: object

# df['CO(GT)'].astype(float)

ValueError: could not convert string to float: '2,6′

It turns out that it is not so easy to convert text to number format – the problem is in commas. We replace commas with dots.

df['CO(GT)'] = df['CO(GT)'].str.replace(',','.')
df['C6H6(GT)'] = df['C6H6(GT)'].str.replace(',','.')
df['T'] = df['T'].str.replace(',','.')
df['RH'] = df['RH'].str.replace(',','.')
df['AH'] = df['AH'].str.replace(',','.')

We change the format from object to float

df[['CO(GT)','C6H6(GT)', 'T','RH','AH']] = df[['CO(GT)','C6H6(GT)', 'T','RH','AH']].astype(float)

df.dtypes

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

We can now check the level of direct correlation.

df.corr()

sns.set(style="ticks")

corr = df.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(22, 10))
cmap = sns.diverging_palette(180, 50, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.3, center=0.1,annot=True,
            square=True, linewidths=.9, cbar_kws={"shrink": 0.8})

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

missing value in CO(GT):  1683

missing value:  383

missing value:  370

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

Step 3. Filling the gaps in variables based on other variables correlated with it

Filling gaps in the CO (GT) variable.

I check what this variable is strongly correlated with and supplement based on this variable, if not, I supplement it as the last or next value.

df.corr()

df.dtypes

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

print('missing value in CO(GT): ',df['CO(GT)'].isnull().sum())

missing value in CO(GT):  1683

missing value:  383

missing value:  370

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

CO (GT) correlation with other variables.

CORREL = df.corr()
CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')

The largest correlation with CO (GT) occurs for C6H6 (GT) which is quite complete. Based on this variable, I fill in the deficiencies in CO (GT).

df['CO(GT)'] = df.groupby('C6H6(GT)')['CO(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['CO(GT)'].isnull().sum())

missing value:  383

missing value:  370

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

df['CO(GT)'] = df.groupby('PT08.S1(CO)')['CO(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['CO(GT)'].isnull().sum())

missing value:  370

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

(9357, 14)

Now I do simple refilling – the last good value.

df['CO(GT)'].fillna(method='ffill', inplace=True)

print('missing value: ',df['CO(GT)'].isnull().sum())

missing value:  0

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

(9357, 14)

(9357, 14)

Filling gaps in the variable 'C6H6 (GT)’

print('missing value: ',df['C6H6(GT)'].isnull().sum())

missing value:  366

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

(9357, 14)

(9357, 14)

df['C6H6(GT)'] = df.groupby('CO(GT)')['C6H6(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['C6H6(GT)'].isnull().sum())

missing value:  0

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

(9357, 14)

(9357, 14)

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Filling gaps in the variable 'NOx(GT)’

print('brakuje wartości: ',df['NOx(GT)'].isnull().sum())

brakuje wartości:  1639

missing value:  0

missing value:  1642

missing value:  0

(9357, 14)

(9357, 14)

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

CORREL['NOx(GT)'].to_frame().sort_values('NOx(GT)')

df['NOx(GT)'] = df.groupby('CO(GT)')['NOx(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['NOx(GT)'].isnull().sum())

missing value:  0

missing value:  1642

missing value:  0

(9357, 14)

(9357, 14)

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Filling gaps in the variable 'C6H6 (GT)’

print('missing value: ',df['NO2(GT)'].isnull().sum())

missing value:  1642

missing value:  0

(9357, 14)

(9357, 14)

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

CORREL['NO2(GT)'].to_frame().sort_values('NO2(GT)')

df['NO2(GT)'] = df.groupby('PT08.S5(O3)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())

df['NO2(GT)'] = df.groupby('CO(GT)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())

print('missing value: ',df['NO2(GT)'].isnull().sum())

missing value:  0

(9357, 14)

(9357, 14)

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')

(9357, 14)

(9357, 14)

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

I complete the records where the entire measuring device did not work.

In the drawing it can be seen as solid lines.

df.shape

(9357, 14)

df.fillna(method='ffill', inplace=True)

df.shape

(9357, 14)

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='Reds')

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

df.isnull().sum()

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Data have been completed! In the next parts of this tutorial, we will carry out the process of building a linear regression model at TensorFlow.

Part 2. Simple multifactorial linear regression

In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed measurement data file has been created, which we will now open.

We will now continue to prepare data for further analysis. One of the most important variables describing in linear regression is time. Most artificial and natural phenomena operate in hourly, daily and monthly cycles.

import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)

Step 1. Launching the time variable

We check what the date format is

df[['Date','Time']].dtypes

Date    object
Time    object
dtype: object

There is no date format in dataframe. Link columns containing time.

df['DATE'] = df['Date']+' '+df['Time']
df['DATE'].head()

0    10/03/2004 18.00.00
1    10/03/2004 19.00.00
2    10/03/2004 20.00.00
3    10/03/2004 21.00.00
4    10/03/2004 22.00.00
Name: DATE, dtype: object

We create a new column containing the date and time. Then we convert the object format to the date format.

df['DATE'] = pd.to_datetime(df.DATE, format='
df.dtypes

Unnamed: 0                int64
Date                     object
Time                     object
CO(GT)                  float64
PT08.S1(CO)             float64
C6H6(GT)                float64
PT08.S2(NMHC)           float64
NOx(GT)                 float64
PT08.S3(NOx)            float64
NO2(GT)                 float64
PT08.S4(NO2)            float64
PT08.S5(O3)             float64
T                       float64
RH                      float64
AH                      float64
DATE             datetime64[ns]
dtype: object

Step 2. We add more columns based on the time variable

In industry, the day of the week is very important, so in such models it is worth adding a column with the number of the day.

df['Month'] = df['DATE'].dt.month
df['Weekday'] = df['DATE'].dt.weekday
df['Weekday_name'] = df['DATE'].dt.weekday_name
df['Hours'] = df['DATE'].dt.hour

df[['DATE','Month','Weekday','Weekday_name','Hours']].sample(3)

Graphical analysis of pollution according to time variables

df.pivot_table(index='Weekday_name', values='CO(GT)', aggfunc='mean').plot(kind='bar')

df.pivot_table(index='Month', values='CO(GT)', aggfunc='mean').plot(kind='bar')

Text(0, 0.5, 'Continuous independent variables')

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

df.pivot_table(index='Hours', values='CO(GT)', aggfunc='mean').plot(kind='bar')

Text(0, 0.5, 'Continuous independent variables')

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

Step 3. Correlation analysis¶

we set the result variable as:

CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)

del df['Unnamed: 0']

CORREL = df.corr()
PKP = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PKP

import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
PKP.plot(kind='barh', color='red')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')

Text(0, 0.5, 'Continuous independent variables')

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Variables based on time are not very well correlated with the described variable: CO (GT).
The temptation arises to use better, better correlated independent variables for the model. The problem is that these variables can be part of the result. So if there is pollution then all substances are in the air.

Our task is to examine how weather and time affect the level of pollution. We’ll cover this task in the next part of the tutorial.

Step 4. We are now checking shift

for independent variables with low direct correlation.
How does weather affect CO2 levels?

Variable RH – Relative humidity (
We check a variable with very low correlation with the resulting CO (GT) variable

def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF

x = df.RH       # independent variable
y = df['CO(GT)']    # dependent variable
R = 20           # number of shifts who will be checked

SKO = shift_Factor(x,y,R)
print('Optimal shift for RH: ',SKO)

Optimal shift for RH:  12

0.39204313671898056

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

cross_corr(x, y, lag=SKO)

0.39204313671898056

Variable AH – Absolute humidity

We check a variable with very low correlation with the resulting CO (GT) variable

x = df.AH       # independent variable
SKP = shift_Factor(x,y,R)
print('Optimal shift for AH: ',SKP)

Optimal shift for AH:  12

0.043756364102677595

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

cross_corr(x, y, lag=SKP)

0.043756364102677595

Absolute humidity AH does not correlate with the variable CO (GT) so we eliminate it from the model

Variable: Temperature in ° C¶

We check a variable with very low correlation with the resulting CO (GT) variable.

x = df['T']      # independent variable
PKP = shift_Factor(x,y,R)
print('Optimal shift for T: ',PKP)

Optimal shift for T:  12

-0.22446569561762522

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

cross_corr(x, y, lag=PKP)

-0.22446569561762522

We are now creating a new DataFrame with a 12-hour shift

It turns out that temperature and humidity only correlate after 12 hours from the time the CO contamination changes.
Data shift creation function.

def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)

Our goal is to create a multiple regression model:

- Independent variables are: Temperature (T) and Relative humidity RH (
- The dependent variable is the level of CO (GT)

df2 = df[['DATE', 'CO(GT)','RH', 'T']]

Adds a date and time to record temperature and humidity.

df2['weather_time'] = df2['DATE']

C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

df2.head(3)

df3 = df_shif(df2, 'weather_time', lag=12)
df3.rename(columns={'weather_time':'Shift_weather_time'}, inplace=True) 
df3.head(13)

df4 = df_shif(df3, 'RH', lag=12)
df4.rename(columns={'RH':'Shift_RH'}, inplace=True)

df5 = df_shif(df4, 'T', lag=12)
df5.rename(columns={'T':'Shift_T'}, inplace=True)

Deletes rows with incomplete data.

df5 = df5.dropna(how ='any')

df5.head()

The table can be understood as meaning that a specific temperature at 6:00 gives a specific concentration of carbon monoxide at 18:00.

Graphical analysis of relationships – Humidity and temperature to carbon monoxide

It looks rather poor

import matplotlib.pyplot as plt

df5.plot(x='Shift_T', y='CO(GT)', style='o')  
plt.title('Shift_T vs CO(GT)')  
plt.xlabel('Shift_T')  
plt.ylabel('CO(GT)')  
plt.show()

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

df5.plot(x='Shift_RH', y='CO(GT)', style='o')  
plt.title('Shift_RH vs CO(GT)')  
plt.xlabel('Shift_RH')  
plt.ylabel('CO(GT)')  
plt.show()

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

Step 5. Building a multiple linear regression model in Sklearn

Declares X, y variables into the model.

X = df5[['Shift_RH', 'Shift_T']].values
y = df5['CO(GT)'].values

I divide the collection into training variables and test variables.

from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

regressor = LinearRegression()  
regressor.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

Comparison of variables from the model with real variables.

dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)

from sklearn import metrics

dfKK.head(50).plot()

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

from sklearn import metrics

print('Mean Absolute Error:    ', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:     ', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756

Mean Squared Error:      0.15437562015505324

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))

Mean Squared Error:      0.15437562015505324

Carbon monoxide contamination cannot be predicted based on humidity and temperature.
In the next part, we will continue the analysis and preparation of data for linear regression.

What is this the correlation shift?

In supervised deep machine learning we have two directions: classification and regression. Regression needs continuous values of data. Because from time to time we are forced to transform discrete data into continues values.

More important, have to say, is to find linear correlation between independent variables and dependent variable who represents result.

How to find correlation?

In natural environment everything is correlated each other. Rain causes the level of the lake to rise. Hot sun causes of the level of the lake to down. It is obvious examples of linear correlation.

But to observe it, use simple correlation can be insufficient.

The problem is in the shift. Rain contribute to rise water in river but this rise appears after couple of hours. Sun makes level of water to down after couple of days. Frankly speaking most correlation in all environments have longer or shorter delays.

How to find correlation shift?

from scipy import signal, fftpack
import pandas as pd
import numpy

Let’s build this dataframe.

AAA = [295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 175, 250,
       190, 265, 205, 280, 220, 295, 235, 310, 250, 325, 265, 340, 280,
       355, 295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 445,
       385, 460, 400, 475, 415, 490, 430, 175, 250, 190, 265, 205, 280,
       220, 295, 235, 310, 250, 325, 265, 340, 280, 355, 295, 370, 310,
       385, 325, 400, 340, 415, 355, 430, 370, 445, 385, 460, 400, 475,
       415, 490, 430, 505, 445, 175, 250, 190, 265, 205, 280, 220, 295,
       235, 310, 250, 325, 265, 340, 280, 355]

BBB = [123, 221, 113, 105, 150, 114, 159, 123, 168, 132, 177, 141, 186,
       150, 195, 159, 204, 168, 213, 177, 222, 186, 231, 195, 240, 204,
       249, 213, 258, 222, 267, 231, 276, 240, 285, 249, 294, 258, 105,
       150, 114, 159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204,
       168, 213, 177, 222, 186, 231, 195, 240, 204, 249, 213, 258, 222,
       267, 231, 276, 240, 285, 249, 294, 258, 303, 267, 105, 150, 114,
       159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204, 168, 213,
       177, 222, 186, 231, 195, 240, 204, 249]

CCC = [124, 154, 130, 160, 136, 166, 142, 172, 148,  70, 100,  76, 106,
        82, 112,  88, 118,  94, 124, 100, 130, 106, 136, 112, 142, 118,
       148, 124, 154, 130, 160, 136, 166, 142, 172, 148, 178, 154, 184,
       160, 190, 166, 196, 172,  70, 100,  76, 106,  82, 112,  88, 118,
        94, 124, 100, 130, 106, 136, 112, 142, 118, 148, 124, 154, 130,
       160, 136, 166, 142, 172, 148, 178, 154, 184, 160, 190, 166, 196,
       172, 202, 178,  70, 100,  76, 106,  82, 112,  88, 118,  94, 124,
       100, 130, 106, 136, 112, 142, 118, 148]

DDD = [ 437,  453,  764,  346,  239,  420,  600,  456,  636,  492,  672,
        528,  708,  564,  744,  600,  780,  636,  816,  672,  852,  708,
        888,  744,  924,  780,  960,  816,  996,  852, 1032,  888, 1068,
        924, 1104,  960, 1140,  996, 1176, 1032,  420,  600,  456,  636,
        492,  672,  528,  708,  564,  744,  600,  780,  636,  816,  672,
        852,  708,  888,  744,  924,  780,  960,  816,  996,  852, 1032,
        888, 1068,  924, 1104,  960, 1140,  996, 1176, 1032, 1212, 1068,
        420,  600,  456,  636,  492,  672,  528,  708,  564,  744,  600,
        780,  636,  816,  672,  852,  708,  888,  744,  924,  780,  960]

RESULT = [ 35,  50,  38,  53,  41,  56,  44,  59,  47,  62,  50,  65,  53,
        68,  56,  71,  59,  74,  62,  77,  65,  80,  68,  83,  71,  86,
        74,  89,  77,  92,  80,  95,  83,  98,  86,  35,  50,  38,  53,
        41,  56,  44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,
        74,  62,  77,  65,  80,  68,  83,  71,  86,  74,  89,  77,  92,
        80,  95,  83,  98,  86, 101,  89,  35,  50,  38,  53,  41,  56,
        44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,  74,  62,
        77,  65,  80,  68,  83,  71,  86,  74]


df = pd.DataFrame({'AAA': AAA, 'BBB': BBB,'CCC':CCC,'DDD':DDD, 'RESULT':RESULT})

df.head()

Descriptive in the DataFrame phenomena are perfectly correlated. But we don’t know about it. Now we use ordinary method of searching correlation.

corr = df.corr()
corr

corr['RESULT']

AAA      -0.261955
BBB       0.383326
CCC      -0.169164
DDD       0.248511
RESULT    1.000000
Name: RESULT, dtype: float64

Is it all? Is it entire correlation for linear regression? How to find correlation delay?

Function to find optimal correlation shift

I made special function to detect optimal shift values for maximal linear correlation between dependent and independent variables.

def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF

We declare variables to function.

x = df.AAA       # independent variable
y = df.RESULT    # dependent variable
R = 20           # number of shifts who will be checked

Shift for variable AAA

We are looking optimal correlation shift in variable AAA.

SKO = shift_Factor(x,y,R)
print('Optimal shift for AAA: ',SKO)

Optimal shift for AAA:  11

0.9999999999999996

Optimal shift for BBB:  3

Optimal shift for CCC:  9

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

We calculate that in 11 rows of shifts there are the biggest correlations between AAA independent variable and RESULT variable (in absolute values). What is the level of correlation?

cross_corr(x, y, lag=SKO)

0.9999999999999996

We create new DateFrame with optimal shift.

def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)

df2 = df_shif(df, 'AAA', lag=SKO)
df2.rename(columns={'AAA':'SHIFTED AAA'}, inplace=True) 
df2.head(13)

Now we repeat these manuals for rest independent variables.

Shift for variable BBB

BBB = df.BBB       # independent variable

SKS = shift_Factor(BBB,y,R)
print('Optimal shift for BBB: ',SKS)

Optimal shift for BBB:  3

Optimal shift for CCC:  9

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

df3 = df_shif(df2, 'BBB', lag=SKS)
df3.rename(columns={'BBB':'SHIFTED BBB'}, inplace=True)

Shift for variable CCC

CCC = df.CCC

SKK = shift_Factor(CCC,y,R)
print('Optimal shift for CCC: ',SKK)

Optimal shift for CCC:  9

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

df4 = df_shif(df3, 'CCC', lag=SKK)
df4.rename(columns={'CCC':'SHIFTED CCC'}, inplace=True)

Shift for variable DDD

DDD = df.DDD

PKP = shift_Factor(DDD,y,R)
print('Optimal shift for DDD: ',PKP)

Optimal shift for DDD:  5

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

df5 = df_shif(df4, 'DDD', lag=PKP)
df5.rename(columns={'DDD':'SHIFTED DDD'}, inplace=True)

Correlation after make the shifts

I wipe rows in dataframe where appear NaN values and calculate correlation.

df5 = df5.dropna(how='any')
df5.head(3)

corr = df5.corr()
corr

corr['RESULT']

SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

As we see, independent variables are perfectly correlated with result variable. This phenomenon was hidden because there were existing shifts.
I hope I convinced that researchers should enter rule of checking shifts during model making.

Today we find out how to create simple regression model to predict level of output electricity in power plant works in combined cycle.

Źródło bazy danych: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Registry contains 9568 measurements made in the period between 2006 and 2011. During the survey power plant was working in full capacity. Registry contains following of process characteristic:

Temperature (T) in the range 1.81°C and 37.11°C,
Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
Relative Humidity (RH) in the range 25.56
Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
Net hourly electrical energy output (EP) 420.26-495.76 MW

A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.

Process of forecasting by the linear regression model¶

We open database and useful libraries.

import pandas as pd
import numpy as np
import itertools
from itertools import chain, combinations
import statsmodels.formula.api as smf
import scipy.stats as scipystats
import statsmodels.api as sm
import statsmodels.stats.stattools as stools
import statsmodels.stats as stats
from statsmodels.graphics.regressionplots import *
import matplotlib.pyplot as plt
import seaborn as sns
import copy
import math

## https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

df = pd.read_excel('c:/1/Folds5x2_pp.xlsx')
df.sample(5)

We analyze size and parameters of registry.

df.shape

(9568, 5)

df.dtypes

AT    float64
V     float64
AP    float64
RH    float64
PE    float64
dtype: object

For improve transparency of code we changed names of columns.

df.columns = ['Temperature', 'Exhaust_Vacuum', 'Ambient_Pressure', 'Relative_Humidity', 'Energy_output']
df.sample(5)

Base conditions to realize linear regression model¶

Failure to comply of these rules lead to wrong, misleading result of model calculation.

There are no correlation between descriptive variables (independent variables)
There are exist linear relationship between predictive and result variables.
The errors from the model should have normal distribution.
Homogeneous of error variance
Lack of correlation between errors. Errors from one observation should be independent to errors from other observation.

Analyze of relationship between dependent and independent variables

CORREL = df.corr().sort_values('Energy_output')
CORREL['Energy_output']

Temperature         -0.948128
Exhaust_Vacuum      -0.869780
Relative_Humidity    0.389794
Ambient_Pressure     0.518429
Energy_output        1.000000
Name: Energy_output, dtype: float64

There are high level of correlation between result variable: energy output (EP) and exogenic factors: Temperature (T) and Exhaust Vacuum (V).

One factor regression model

lm = smf.ols(formula = 'Energy_output ~ Temperature', data = df).fit()
lm.summary()

One factor regression model has a very good predictive properties.

plt.figure()
plt.scatter(df.Temperature, df.Energy_output, c = 'grey')
plt.plot(df.Temperature, lm.params[0] + lm.params[1] * df.Temperature, c = 'r')
plt.xlabel('Temperature')
plt.ylabel('Energy_output')
plt.title("Linear Regression Plot")

Text(0.5, 1.0, 'Linear Regression Plot')

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          Energy_output   R-squared:                       0.929
Model:                            OLS   Adj. R-squared:                  0.929
Method:                 Least Squares   F-statistic:                 3.114e+04
Date:                Fri, 06 Sep 2019   Prob (F-statistic):               0.00
Time:                        13:10:53   Log-Likelihood:                -28088.
No. Observations:                9568   AIC:                         5.619e+04
Df Residuals:                    9563   BIC:                         5.622e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept           454.6093      9.749     46.634      0.000     435.500     473.718
Temperature          -1.9775      0.015   -129.342      0.000      -2.007      -1.948
Exhaust_Vacuum       -0.2339      0.007    -32.122      0.000      -0.248      -0.220
Relative_Humidity    -0.1581      0.004    -37.918      0.000      -0.166      -0.150
Ambient_Pressure      0.0621      0.009      6.564      0.000       0.044       0.081
==============================================================================
Omnibus:                      892.002   Durbin-Watson:                   2.033
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4086.777
Skew:                          -0.352   Prob(JB):                         0.00
Kurtosis:                       6.123   Cond. No.                     2.13e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.13e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Temperature          0
Exhaust_Vacuum       0
Ambient_Pressure     0
Relative_Humidity    0
Energy_output        0
dtype: int64

Multi factor regression model

lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
print (lm.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:          Energy_output   R-squared:                       0.929
Model:                            OLS   Adj. R-squared:                  0.929
Method:                 Least Squares   F-statistic:                 3.114e+04
Date:                Fri, 06 Sep 2019   Prob (F-statistic):               0.00
Time:                        13:10:53   Log-Likelihood:                -28088.
No. Observations:                9568   AIC:                         5.619e+04
Df Residuals:                    9563   BIC:                         5.622e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept           454.6093      9.749     46.634      0.000     435.500     473.718
Temperature          -1.9775      0.015   -129.342      0.000      -2.007      -1.948
Exhaust_Vacuum       -0.2339      0.007    -32.122      0.000      -0.248      -0.220
Relative_Humidity    -0.1581      0.004    -37.918      0.000      -0.166      -0.150
Ambient_Pressure      0.0621      0.009      6.564      0.000       0.044       0.081
==============================================================================
Omnibus:                      892.002   Durbin-Watson:                   2.033
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4086.777
Skew:                          -0.352   Prob(JB):                         0.00
Kurtosis:                       6.123   Cond. No.                     2.13e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.13e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Temperature          0
Exhaust_Vacuum       0
Ambient_Pressure     0
Relative_Humidity    0
Energy_output        0
dtype: int64

Multi factor regression model have very good descriptive features. It raises doubts.

Analyze of variables used in mode

df.describe()

Analyze of distribution of result variable

plt.rcParams['figure.figsize'] = (5, 4)
sns.distplot(df['Energy_output'])

Temperature          0
Exhaust_Vacuum       0
Ambient_Pressure     0
Relative_Humidity    0
Energy_output        0
dtype: int64

Analysis of correlation between independent variables

plt.rcParams['figure.figsize'] = (5, 4)
sns.heatmap (df.corr (), cmap="YlGnBu")

Temperature          0
Exhaust_Vacuum       0
Ambient_Pressure     0
Relative_Humidity    0
Energy_output        0
dtype: int64

We can easily remark that between independent variables: Temperature (T) and Exhaust Vacuum (V) exist very high positive correlation.
Another form of correlation matrix.

sns.heatmap (df.corr (), cmap="coolwarm", annot=True, cbar=False)

Temperature          0
Exhaust_Vacuum       0
Ambient_Pressure     0
Relative_Humidity    0
Energy_output        0
dtype: int64

We check completeness of survey.

df.isnull().sum()

Temperature          0
Exhaust_Vacuum       0
Ambient_Pressure     0
Relative_Humidity    0
Energy_output        0
dtype: int64

Graphical presentation of relationship between independent and dependent variables

We divide result variable into two categories: „High power” and „Low power”. We do it to remark what descriptive variables work.

Ewa = ['High power', 'Low power']

df['Power'] = pd.qcut(df['Energy_output'],2, labels=Ewa)
df.sample(4)

We create graphical relationship matrix.

sns.pairplot(data=df[['Temperature' ,'Exhaust_Vacuum','Ambient_Pressure', 'Relative_Humidity', 'Power']], hue='Power', dropna=True, height=2)

Relationship among Temperature (T) and Exhaust Vacuum (V) ones again showed high level of correlation.

sns.jointplot(x='Temperature', y='Exhaust_Vacuum', data=df)

In the previous part of this publication I showed how to compose linear regression model. Now we will concentrate on variables analyses.
Let’s remind fundamental rules of linear regression model.

Failure to comply of these rules lead to wrong, misleading result of model calculation.

There are no correlation between descriptive variables (independent variables)
There are exist linear relationship between predictive and result variables.
The errors from the model should have normal distribution.
Homogeneous of error variance
Lack of correlation between errors. Errors from one observation should be independent to errors from other observation.

We load needed libraries and Database.

Parameters of multi factor regression model

We load needed libraries and Database.

import pandas as pd
import numpy as np
import itertools
from itertools import chain, combinations
import statsmodels.formula.api as smf
import scipy.stats as scipystats
import statsmodels.api as sm
import statsmodels.stats.stattools as stools
import statsmodels.stats as stats 
from statsmodels.graphics.regressionplots import *
import matplotlib.pyplot as plt
import seaborn as sns
import copy
import math
## https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

df = pd.read_excel('c:/1/Folds5x2_pp.xlsx')

df.columns = ['Temperature', 'Exhaust_Vacuum', 'Ambient_Pressure', 'Relative_Humidity', 'Energy_output']

We create multi factor regression model.

lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
lm.summary()

coef = lm.params.to_dict()

Now we can recall any coefficient of descriptive variables.

coef['Temperature']

-1.977513106635385

coef['Intercept']

454.6092743153056

Displaying the rests (errors) of the model

The Results from the model (the error of the prediction) it is different between empirical values and their theoretical prediction come from the regression model.

lm.resid.sample(6)

2922    2.484973
5860    4.214192
2534   -7.031075
5202   -3.479953
3076   -5.572512
7444    4.709399
dtype: float64

Displaying the result of model prediction

We generate predicted values from the model.

lm.predict()

array([467.26978996, 444.0773659 , 483.56264263, ..., 432.40579787,
       443.03667582, 449.69603741])

We create additional columns in Dataframe who contain prediction and values of rests, that is the difference between empirical values and their estimation.

df['Predict'] = pd.Series(lm.predict())
df['Resid'] = pd.Series(lm.resid)
df.sample(8)[['Energy_output','Predict','Resid']]

Analyze of normal distribution of the rest from the model

sns.kdeplot(np.array(df['Resid']), bw=10)

((array([-3.79967944, -3.57391754, -3.44994448, ...,  3.44994448,
          3.57391754,  3.79967944]),
  array([-43.43539373, -43.03584359, -38.35687614, ...,  16.8007967 ,
          17.42287291,  17.77771926])),
 (4.509644478448458, 4.154557260475787e-12, 0.989212398876157))

AndersonResult(statistic=9.20901667254293, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

KstestResult(statistic=0.3311386887701732, pvalue=0.0)

2.0329358073274766

[43761.15186572733, 5.9776016741918685, 3.9430030401259746, 1.7052900320740478, 1.4526394006715817]

On this plot of normal distribution is worrying the left tail of the plot, is to long.

sns.distplot(np.array(df['Resid']))

((array([-3.79967944, -3.57391754, -3.44994448, ...,  3.44994448,
          3.57391754,  3.79967944]),
  array([-43.43539373, -43.03584359, -38.35687614, ...,  16.8007967 ,
          17.42287291,  17.77771926])),
 (4.509644478448458, 4.154557260475787e-12, 0.989212398876157))

AndersonResult(statistic=9.20901667254293, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

KstestResult(statistic=0.3311386887701732, pvalue=0.0)

2.0329358073274766

[43761.15186572733, 5.9776016741918685, 3.9430030401259746, 1.7052900320740478, 1.4526394006715817]

Anderson-Darling normality test

Test Anderson-Darling was elaborated by Teodora Andersona i Donalda Darling in 1952 year. Distribution is normal when cover the diagonal line.

sm.qqplot(df['Resid'],color='r')

((array([-3.79967944, -3.57391754, -3.44994448, ...,  3.44994448,
          3.57391754,  3.79967944]),
  array([-43.43539373, -43.03584359, -38.35687614, ...,  16.8007967 ,
          17.42287291,  17.77771926])),
 (4.509644478448458, 4.154557260475787e-12, 0.989212398876157))

AndersonResult(statistic=9.20901667254293, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

KstestResult(statistic=0.3311386887701732, pvalue=0.0)

2.0329358073274766

[43761.15186572733, 5.9776016741918685, 3.9430030401259746, 1.7052900320740478, 1.4526394006715817]

import pylab
scipystats.probplot(df['Resid'], dist="norm", plot=pylab)

((array([-3.79967944, -3.57391754, -3.44994448, ...,  3.44994448,
          3.57391754,  3.79967944]),
  array([-43.43539373, -43.03584359, -38.35687614, ...,  16.8007967 ,
          17.42287291,  17.77771926])),
 (4.509644478448458, 4.154557260475787e-12, 0.989212398876157))

AndersonResult(statistic=9.20901667254293, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

KstestResult(statistic=0.3311386887701732, pvalue=0.0)

2.0329358073274766

[43761.15186572733, 5.9776016741918685, 3.9430030401259746, 1.7052900320740478, 1.4526394006715817]

We calculate value of the Anderson-Darling gauge.

import scipy
scipy.stats.anderson(df['Resid'], dist='norm' )

AndersonResult(statistic=9.20901667254293, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

We obtain statistics [15. , 10. , 5. , 2.5, 1. ].
If obtained statistic (statistic=9.20901667254293) is higher than critical parameters ( critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092])), then for appropriate level of significance (significance_level=array([15. , 10. , 5. , 2.5, 1. ])), we able to reject zero hypothesis that say that it is normal distribution.

Anderson-Darling gauge says that the rest distribution have no normal character.

Kołmogorowa-Smirnowa test

from scipy import stats
stats.kstest(df['Resid'], 'norm')

KstestResult(statistic=0.3311386887701732, pvalue=0.0)

Value of statistic is lower than p-value, that point the distribution of model rests are not normal distribution.

We made comparison the realisation and prediction.

sns.kdeplot(np.array(df['Energy_output']), bw=10)
sns.distplot(np.array(df['Predict']))

2.0329358073274766

[43761.15186572733, 5.9776016741918685, 3.9430030401259746, 1.7052900320740478, 1.4526394006715817]

Test for homogeneity of the errors variance

One of the principles of regression is to stable, homogeneity values of model rests. When model is good for the variances should be homogeneous.

resid = lm.resid
plt.scatter(lm.predict(), resid)

2.0329358073274766

[43761.15186572733, 5.9776016741918685, 3.9430030401259746, 1.7052900320740478, 1.4526394006715817]

The plot shows that part of errors have big dispersion, variance are not entirely homogeneous.

Test for autocorrelation of the rests from the model

One of the fundamental principles of the linear regression model is independence of the errors of observations. The rest from the model should be not correlated among. Good fitted model assumed when the rests are independent among, their distribution is random, without any pattern.

One of the method to check random character of rests distribution is test r-Pearsona. Test check autocorrelation among errors. To this check Durbina-Watsona statistic is used. Value for ours model obtained in standard information sheet is 2.

Source of knowledge: http://www.naukowiec.org/wiedza/statystyka/test-durbina-watsona-niezaleznosc-bledow-obserwacji_423.html

import statsmodels
statsmodels.stats.stattools.durbin_watson(df['Resid'], axis=0)

2.0329358073274766

Statistic of the DW test is equal 2 (we got this information also from standard information sheet of the linear regression model). The DW test has from 0 to 4 range When value is going to 4 exist negativ autocorrelation, when statistic come to 0, there are positive correlations among rests. Value 2 is pointing that no exist autocorrelation among errors.

Test of collinearity among predictor variables

One of the fundamental principles of regression is lack of collinearity among independent variables. When two predictors are strongly among correlated, one of the variable loses their predictive power, loses to second predictor. Strong correlation among predictors lead to deterioration of the model parameters.

We found out in previous part of publication that exist big correlation between two variables: Temperature (T) and Exhaust Vacuum (V). Now appear question which variable we would like to remove from the model. Good tool is the factor VIF (Variation Inflation Factor). Thanks to this factor we able to point which variable should be eliminated from the model.

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
y, X = dmatrices('Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df, return_type = "dataframe")
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)

[43761.15186572733, 5.9776016741918685, 3.9430030401259746, 1.7052900320740478, 1.4526394006715817]

Result in form of vector represent variable in the certain order as in model. Recommendation of the VIF pointed, that if factor assigned to variable is more than 5, this variable is highly correlated with another variables and should be eliminated from the model.

Test showed that Temperature (T) variable should be removed from the model. This is confirmation of our conclusion from firs part of investigation, that there are existed detrimental collinearity between predictors.

Ending conclusion

To effective use regression model to describe and next monitor production processes we need to deeply check of meet of fundamental principles of regression. This time we found our, we ought to remove one of the factor because exist predictor collinearity.

After eliminate this variable all work should be done ones again.

It is possible to utilize one factor model using highly correlated variable with the result variable.

	Unnamed: 0	DATE	CO(GT)	Shift_RH	Shift_T	Shift_weather_time
0	36	2004-03-10 18:00:00	2.6	58.1	10.5	2004-03-11 06:00:00
1	37	2004-03-10 19:00:00	2.0	59.6	10.2	2004-03-11 07:00:00
2	38	2004-03-10 20:00:00	2.2	57.4	10.8	2004-03-11 08:00:00

	Unnamed: 0	DATE	CO(GT)	Shift_RH	Shift_T	Shift_weather_time	Shift_RH^2	Shift_T^2
0	36	2004-03-10 18:00:00	2.6	58.1	10.5	2004-03-11 06:00:00	3375.61	110.25
1	37	2004-03-10 19:00:00	2.0	59.6	10.2	2004-03-11 07:00:00	3552.16	104.04
2	38	2004-03-10 20:00:00	2.2	57.4	10.8	2004-03-11 08:00:00	3294.76	116.64

	CO(GT) Actual	CO(GT)_Predicted	SSE	ave_y
0	1.3	1.18	0.0144	2.086004
1	0.6	0.76	0.0256	2.086004
2	0.4	0.45	0.0025	2.086004

	CO(GT) Actual	CO(GT)_Predicted	SSE	ave_y	SST
0	1.3	1.18	0.0144	2.086004	0.617803
1	0.6	0.76	0.0256	2.086004	2.208209
2	0.4	0.45	0.0025	2.086004	2.842610

	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH	Month	Weekday	Hours
0	2.6	1360.0	11.9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.6	48.9	0.7578	3	2	18
1	2.0	1292.0	9.4	955.0	103.0	1174.0	92.0	1559.0	972.0	13.3	47.7	0.7255	3	2	19
2	2.2	1402.0	9.0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11.9	54.0	0.7502	3	2	20

	SystemCodeNumber	Capacity	Occupancy	LastUpdated
0	BHMBCCMKT01	577	61	2016-10-04 07:59:42
1	BHMBCCMKT01	577	64	2016-10-04 08:25:42
2	BHMBCCMKT01	577	80	2016-10-04 08:59:42

	SystemCodeNumber	Capacity	Occupancy	LastUpdated	month	hour	weekday_name	weekday
0	BHMBCCMKT01	577	61	2016-10-04 07:59:42	10	7	Tuesday	1
1	BHMBCCMKT01	577	64	2016-10-04 08:25:42	10	8	Tuesday	1
2	BHMBCCMKT01	577	80	2016-10-04 08:59:42	10	8	Tuesday	1
3	BHMBCCMKT01	577	107	2016-10-04 09:32:46	10	9	Tuesday	1

	Occupancy
weekday	-0.027324
month	0.314196
hour	0.662273
Occupancy	1.000000
Capacity	NaN

	Occupancy
weekday	-0.027324
HoW	0.059408
month	0.314196
hour	0.662273
Occupancy	1.000000
Capacity	NaN

	Unnamed: 0	Unnamed: 0.1	Unnamed: 0.1.1	Date	Time	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	…	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH	DATE	Month	Weekday	Weekday_name	Hours
0	0	0	0	10/03/2004	18.00.00	2.6	1360.0	11.9	1046.0	166.0	…	1692.0	1268.0	13.6	48.9	0.7578	2004-03-10 18:00:00	3	2	Wednesday	18
1	1	1	1	10/03/2004	19.00.00	2.0	1292.0	9.4	955.0	103.0	…	1559.0	972.0	13.3	47.7	0.7255	2004-03-10 19:00:00	3	2	Wednesday	19
2	2	2	2	10/03/2004	20.00.00	2.2	1402.0	9.0	939.0	131.0	…	1555.0	1074.0	11.9	54.0	0.7502	2004-03-10 20:00:00	3	2	Wednesday	20

	CO(GT)
PT08.S3(NOx)	-0.715683
Weekday	-0.140231
RH	0.020122
AH	0.025227
T	0.025639
Unnamed: 0.1	0.042099
Unnamed: 0.1.1	0.042099
Month	0.112291
Hours	0.344071
PT08.S4(NO2)	0.631854
NO2(GT)	0.682774
NOx(GT)	0.773677
PT08.S5(O3)	0.858762
PT08.S1(CO)	0.886114
PT08.S2(NMHC)	0.918386
C6H6(GT)	0.932584
CO(GT)	1.000000

	y	y_pred
0	2.2	2.290434
1	1.2	1.419513
2	1.0	0.991711
3	1.5	1.413460
4	1.6	1.471673

	y	y_pred	SSE
0	2.2	2.290434	0.008178
1	1.2	1.419513	0.048186
2	1.0	0.991711	0.000069

	y	y_pred	SSE	ave_y
0	2.2	2.290434	0.008178	2.061304
1	1.2	1.419513	0.048186	2.061304
2	1.0	0.991711	0.000069	2.061304

	PT08.S1(CO)	NMHC(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	Unnamed: 15	Unnamed: 16
min	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	-200.000000	NaN	NaN
max	2040.000000	1189.000000	2214.000000	1479.000000	2683.000000	340.000000	2775.000000	2523.000000	NaN	NaN
mean	1048.990061	-159.090093	894.595276	168.616971	794.990168	58.148873	1391.479641	975.072032	NaN	NaN
median	1053.000000	-200.000000	895.000000	141.000000	794.000000	96.000000	1446.000000	942.000000	NaN	NaN

	Date	Time	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
9	11/03/2004	03.00.00	0,6	1010.0	1,7	561.0	NaN	1705.0	NaN	1235.0	501.0	10,3	60,2	0,7517
10	11/03/2004	04.00.00	NaN	1011.0	1,3	527.0	21.0	1818.0	34.0	1197.0	445.0	10,1	60,5	0,7465
33	12/03/2004	03.00.00	0,8	889.0	1,9	574.0	NaN	1680.0	NaN	1187.0	512.0	7,0	62,3	0,6261
34	12/03/2004	04.00.00	NaN	831.0	1,1	506.0	21.0	1893.0	32.0	1134.0	384.0	6,1	65,9	0,6248
39	12/03/2004	09.00.00	NaN	1545.0	22,1	1353.0	NaN	767.0	NaN	2058.0	1588.0	9,2	56,2	0,6561
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
9058	23/03/2005	04.00.00	NaN	993.0	2,3	604.0	85.0	848.0	65.0	1160.0	762.0	14,5	66,4	1,0919
9130	26/03/2005	04.00.00	NaN	1122.0	6,0	811.0	181.0	641.0	92.0	1336.0	1122.0	16,2	71,2	1,3013
9202	29/03/2005	04.00.00	NaN	883.0	1,3	530.0	63.0	997.0	46.0	1102.0	617.0	13,7	68,2	1,0611
9274	01/04/2005	04.00.00	NaN	818.0	0,8	473.0	47.0	1257.0	41.0	898.0	323.0	13,7	48,8	0,7606
9346	04/04/2005	04.00.00	NaN	864.0	0,8	478.0	52.0	1116.0	43.0	958.0	489.0	11,8	56,0	0,7743

	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
CO(GT)	1.000000	0.879288	0.931078	0.915514	0.795028	-0.703446	0.683343	0.630703	0.854182	0.022109	0.048890	0.048556
PT08.S1(CO)	0.879288	1.000000	0.883795	0.892964	0.713654	-0.771938	0.641529	0.682881	0.899324	0.048627	0.114606	0.135324
C6H6(GT)	0.931078	0.883795	1.000000	0.981950	0.718839	-0.735744	0.614474	0.765731	0.865689	0.198956	-0.061681	0.167972
PT08.S2(NMHC)	0.915514	0.892964	0.981950	1.000000	0.704435	-0.796703	0.646245	0.777254	0.880578	0.241373	-0.090380	0.186933
NOx(GT)	0.795028	0.713654	0.718839	0.704435	1.000000	-0.655707	0.763111	0.233731	0.787046	-0.269683	0.221032	-0.149323
PT08.S3(NOx)	-0.703446	-0.771938	-0.735744	-0.796703	-0.655707	1.000000	-0.652083	-0.538468	-0.796569	-0.145112	-0.056740	-0.232017
NO2(GT)	0.683343	0.641529	0.614474	0.646245	0.763111	-0.652083	1.000000	0.157360	0.708128	-0.186533	-0.091759	-0.335022
PT08.S4(NO2)	0.630703	0.682881	0.765731	0.777254	0.233731	-0.538468	0.157360	1.000000	0.591144	0.561270	-0.032188	0.629641
PT08.S5(O3)	0.854182	0.899324	0.865689	0.880578	0.787046	-0.796569	0.708128	0.591144	1.000000	-0.027172	0.124956	0.070751
T	0.022109	0.048627	0.198956	0.241373	-0.269683	-0.145112	-0.186533	0.561270	-0.027172	1.000000	-0.578621	0.656397
RH	0.048890	0.114606	-0.061681	-0.090380	0.221032	-0.056740	-0.091759	-0.032188	0.124956	-0.578621	1.000000	0.167971
AH	0.048556	0.135324	0.167972	0.186933	-0.149323	-0.232017	-0.335022	0.629641	0.070751	0.656397	0.167971	1.000000

	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
CO(GT)	1.000000	0.879288	0.931078	0.915514	0.795028	-0.703446	0.683343	0.630703	0.854182	0.022109	0.048890	0.048556
PT08.S1(CO)	0.879288	1.000000	0.883795	0.892964	0.713654	-0.771938	0.641529	0.682881	0.899324	0.048627	0.114606	0.135324
C6H6(GT)	0.931078	0.883795	1.000000	0.981950	0.718839	-0.735744	0.614474	0.765731	0.865689	0.198956	-0.061681	0.167972
PT08.S2(NMHC)	0.915514	0.892964	0.981950	1.000000	0.704435	-0.796703	0.646245	0.777254	0.880578	0.241373	-0.090380	0.186933
NOx(GT)	0.795028	0.713654	0.718839	0.704435	1.000000	-0.655707	0.763111	0.233731	0.787046	-0.269683	0.221032	-0.149323
PT08.S3(NOx)	-0.703446	-0.771938	-0.735744	-0.796703	-0.655707	1.000000	-0.652083	-0.538468	-0.796569	-0.145112	-0.056740	-0.232017
NO2(GT)	0.683343	0.641529	0.614474	0.646245	0.763111	-0.652083	1.000000	0.157360	0.708128	-0.186533	-0.091759	-0.335022
PT08.S4(NO2)	0.630703	0.682881	0.765731	0.777254	0.233731	-0.538468	0.157360	1.000000	0.591144	0.561270	-0.032188	0.629641
PT08.S5(O3)	0.854182	0.899324	0.865689	0.880578	0.787046	-0.796569	0.708128	0.591144	1.000000	-0.027172	0.124956	0.070751
T	0.022109	0.048627	0.198956	0.241373	-0.269683	-0.145112	-0.186533	0.561270	-0.027172	1.000000	-0.578621	0.656397
RH	0.048890	0.114606	-0.061681	-0.090380	0.221032	-0.056740	-0.091759	-0.032188	0.124956	-0.578621	1.000000	0.167971
AH	0.048556	0.135324	0.167972	0.186933	-0.149323	-0.232017	-0.335022	0.629641	0.070751	0.656397	0.167971	1.000000

	CO(GT)
PT08.S3(NOx)	-0.703446
T	0.022109
AH	0.048556
RH	0.048890
PT08.S4(NO2)	0.630703
NO2(GT)	0.683343
NOx(GT)	0.795028
PT08.S5(O3)	0.854182
PT08.S1(CO)	0.879288
PT08.S2(NMHC)	0.915514
C6H6(GT)	0.931078
CO(GT)	1.000000

	Date	Time	CO(GT)	PT08.S1(CO)	NMHC(GT)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
9357	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9358	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9359	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9360	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9361	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
9466	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9467	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9468	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9469	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
9470	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	DATE	Month	Weekday	Weekday_name	Hours
6109	2004-11-20 07:00:00	11	5	Saturday	7
3537	2004-08-05 03:00:00	8	3	Thursday	3
8053	2005-02-09 07:00:00	2	2	Wednesday	7

	AAA	BBB	CCC	DDD	RESULT
0	295	123	124	437	35
1	370	221	154	453	50
2	310	113	130	764	38
3	385	105	160	346	53
4	325	150	136	239	41

	AAA	BBB	CCC	DDD	RESULT
AAA	1.000000	0.072278	0.715892	0.206945	-0.261955
BBB	0.072278	1.000000	0.244349	0.748050	0.383326
CCC	0.715892	0.244349	1.000000	0.389072	-0.169164
DDD	0.206945	0.748050	0.389072	1.000000	0.248511
RESULT	-0.261955	0.383326	-0.169164	0.248511	1.000000

	SHIFTED AAA	BBB	CCC	DDD	RESULT
0	295	NaN	NaN	NaN	NaN
1	370	NaN	NaN	NaN	NaN
2	310	NaN	NaN	NaN	NaN
3	385	NaN	NaN	NaN	NaN
4	325	NaN	NaN	NaN	NaN
5	400	NaN	NaN	NaN	NaN
6	340	NaN	NaN	NaN	NaN
7	415	NaN	NaN	NaN	NaN
8	355	NaN	NaN	NaN	NaN
9	430	NaN	NaN	NaN	NaN
10	370	NaN	NaN	NaN	NaN
11	175	123.0	124.0	437.0	35.0
12	250	221.0	154.0	453.0	50.0

	SHIFTED AAA	SHIFTED BBB	SHIFTED CCC	SHIFTED DDD	RESULT
28	175.0	105.0	70.0	420.0	35.0
29	250.0	150.0	100.0	600.0	50.0
30	190.0	114.0	76.0	456.0	38.0

	SHIFTED AAA	SHIFTED BBB	SHIFTED CCC	SHIFTED DDD	RESULT
SHIFTED AAA	1.0	1.0	1.0	1.0	1.0
SHIFTED BBB	1.0	1.0	1.0	1.0	1.0
SHIFTED CCC	1.0	1.0	1.0	1.0	1.0
SHIFTED DDD	1.0	1.0	1.0	1.0	1.0
RESULT	1.0	1.0	1.0	1.0	1.0

	AT	V	AP	RH	PE
5844	20.76	59.21	1017.94	80.84	447.67
8410	7.54	38.56	1016.49	69.10	486.76
7916	14.73	58.20	1019.45	82.48	458.56
2253	14.56	40.69	1015.48	73.73	469.31
6735	14.57	37.86	1022.44	68.82	463.58

	Temperature	Exhaust_Vacuum	Ambient_Pressure	Relative_Humidity	Energy_output
9540	23.53	50.05	1005.63	78.40	443.71
9270	7.49	39.81	1018.20	85.96	483.71
163	21.87	61.45	1011.13	92.22	445.21
3479	21.63	50.16	1005.70	84.84	448.32
774	26.28	75.23	1011.44	68.35	435.28

Dep. Variable:	Energy_output	R-squared:	0.899
Model:	OLS	Adj. R-squared:	0.899
Method:	Least Squares	F-statistic:	8.510e+04
Date:	Fri, 06 Sep 2019	Prob (F-statistic):	0.00
Time:	13:10:52	Log-Likelihood:	-29756.
No. Observations:	9568	AIC:	5.952e+04
Df Residuals:	9566	BIC:	5.953e+04
Df Model:	1
Covariance Type:	nonrobust

linear regression - THE DATA SCIENCE LIBRARY

Improving the linear regression model using synthetic variables Part. 1

Parking Birmingham occupancy analysis

Checking the completeness of the data

Checking data type¶

Sklearn linear regression model

Improving the linear regression model

We create a synthetic variable: Hour on Weekday (HoW)

Polynomial regression model

Polynomial regression model

Polynomial regression model in practice

Linear regression model:

Polynomial regression model:

We create polynomial model:

Tutorial: Linear Regression – Sklearn regression model (#3/281120191333)

Checking the correlation with the result variable

Part one: Linear regression in Sklearn

How to calculate the R Square parameter?

Point 1. We calculate the square error

Point 2. We calculate the average empirical value of y

Point 3. We calculate the difference between empirical values y and the average of empirical values y

Point 4. We calculate the difference between sum of SST and sum of SSE¶

Point 5. We calculate the R Square parameter

Tutorial: Linear Regression – Tensorflow, calculation of R Square (#4/281120191525)

Step 1: Convert Data

Step 2: Defining the estimator

Step 3: Model training

Step 4. Model evaluation

Calculation of R Square parameter using Tensorflow

I make a prediction on a test set

Calculation of R Square parameter using Pandas

Point 2. We calculate the average empirical value of y

Point 3. We calculate the difference between empirical values y and the average of empirical values y

Point 4. We calculate the difference between sum of SST and sum of SSE

Point 5. We calculate the R Square parameter

Tutorial: Linear Regression – preliminary data preparation (#1/271120191024)

Part 1. Preliminary data preparation

AirQualityUCI

Data Set Information:

Attribute Information:

Step 1: Preliminary analysis of data gaps

Step 2: Check the level of direct correlation to complete the data

Step 3. Filling the gaps in variables based on other variables correlated with it

Filling gaps in the variable 'C6H6 (GT)’

Filling gaps in the variable 'NOx(GT)’

Filling gaps in the variable 'C6H6 (GT)’

I complete the records where the entire measuring device did not work.

Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334)

Part 2. Simple multifactorial linear regression

Step 1. Launching the time variable

Step 2. We add more columns based on the time variable

Graphical analysis of pollution according to time variables

Step 3. Correlation analysis¶

CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)

Step 4. We are now checking shift

Variable RH – Relative humidity ( We check a variable with very low correlation with the resulting CO (GT) variable

Variable AH – Absolute humidity

Variable: Temperature in ° C¶

We are now creating a new DataFrame with a 12-hour shift

Graphical analysis of relationships – Humidity and temperature to carbon monoxide

Step 5. Building a multiple linear regression model in Sklearn

Example of the use of shift for linear regression in Python. How to find optimal correlation shift?

What is this the correlation shift?

How to find correlation?

How to find correlation shift?

Function to find optimal correlation shift

Shift for variable AAA

Shift for variable BBB

Shift for variable CCC

Shift for variable DDD

Correlation after make the shifts

(Part 1) Process of forecasting the output electricity for power plant working in combined cycle by the linear regression equation

Process of forecasting by the linear regression model¶

Base conditions to realize linear regression model¶

Analyze of relationship between dependent and independent variables

One factor regression model

Multi factor regression model

Analyze of variables used in mode

Analyze of distribution of result variable

Analysis of correlation between independent variables

Variable RH – Relative humidity (
We check a variable with very low correlation with the resulting CO (GT) variable

	AAA	BBB	CCC	DDD	RESULT
0	295	123	124	437	35
1	370	221	154	453	50
2	310	113	130	764	38
3	385	105	160	346	53
4	325	150	136	239	41

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	497.0341	0.156	3177.280	0.000	496.727	497.341
Temperature	-2.1713	0.007	-291.715	0.000	-2.186	-2.157

Omnibus:	417.457	Durbin-Watson:	2.033
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1117.844
Skew:	-0.209	Prob(JB):	1.83e-243
Kurtosis:	4.621	Cond. No.	59.4

	Temperature	Exhaust_Vacuum	Ambient_Pressure	Relative_Humidity	Energy_output
count	9568.000000	9568.000000	9568.000000	9568.000000	9568.000000
mean	19.651231	54.305804	1013.259078	73.308978	454.365009
std	7.452473	12.707893	5.938784	14.600269	17.066995
min	1.810000	25.360000	992.890000	25.560000	420.260000
25	13.510000	41.740000	1009.100000	63.327500	439.750000
50	20.345000	52.080000	1012.940000	74.975000	451.550000
75	25.720000	66.540000	1017.260000	84.830000	468.430000
max	37.110000	81.560000	1033.300000	100.160000	495.760000

	Temperature	Exhaust_Vacuum	Ambient_Pressure	Relative_Humidity	Energy_output	Power
2743	30.20	73.67	1006.31	62.14	428.72	High power
1095	30.33	68.67	1006.00	54.99	435.53	High power
4327	9.15	41.82	1032.88	75.11	477.78	Low power
3457	32.74	68.31	1010.23	41.29	441.66	High power

Omnibus:	892.002	Durbin-Watson:	2.033
Prob(Omnibus):	0.000	Jarque-Bera (JB):	4086.777
Skew:	-0.352	Prob(JB):	0.00
Kurtosis:	6.123	Cond. No.	2.13e+05

	Energy_output	Predict	Resid
4620	463.68	468.239190	-4.559190
4330	489.03	481.338666	7.691334
320	455.28	450.579079	4.700921
9108	449.93	450.440091	-0.510091
991	446.80	450.080389	-3.280389
8617	447.22	450.579220	-3.359220
9247	442.73	445.176771	-2.446771
3171	425.14	430.794283	-5.654283