linear regression - THE DATA SCIENCE LIBRARY http://sigmaquality.pl/tag/linear-regression/ Wojciech Moszczyński Tue, 03 Dec 2019 19:09:00 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png linear regression - THE DATA SCIENCE LIBRARY http://sigmaquality.pl/tag/linear-regression/ 32 32 Improving the linear regression model using synthetic variables Part. 1 https://sigmaquality.pl/uncategorized/improving-the-linear-regression-model-using-synthetic-variables/ Tue, 03 Dec 2019 19:09:00 +0000 http://sigmaquality.pl/improving-the-linear-regression-model-using-synthetic-variables/ EN031220191208   Parking Birmingham occupancy analysis Source of data: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham In [1]: import pandas as pd df = pd.read_csv('c:/TF/ParkingBirmingham.csv') df.head(3) Out[1]:   SystemCodeNumber Capacity Occupancy LastUpdated [...]

Artykuł Improving the linear regression model using synthetic variables Part. 1 pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
EN031220191208
 

Parking Birmingham occupancy analysis

Source of data: https://archive.ics.uci.edu/ml/datasets/Parking+Birmingham

In [1]:
import pandas as pd

df = pd.read_csv('c:/TF/ParkingBirmingham.csv')
df.head(3)
Out[1]:
  SystemCodeNumber Capacity Occupancy LastUpdated
0 BHMBCCMKT01 577 61 2016-10-04 07:59:42
1 BHMBCCMKT01 577 64 2016-10-04 08:25:42
2 BHMBCCMKT01 577 80 2016-10-04 08:59:42
 

We have parking lots in Birmingham?

In [2]:
df['SystemCodeNumber'].value_counts()
Out[2]:
BHMNCPHST01         1312
Broad Street        1312
BHMBCCMKT01         1312
BHMEURBRD01         1312
Shopping            1312
Others-CCCPS119a    1312
Others-CCCPS105a    1312
BHMNCPNST01         1312
Others-CCCPS98      1312
BHMMBMMBX01         1312
Others-CCCPS135a    1312
Others-CCCPS202     1312
BHMBCCTHL01         1312
Others-CCCPS8       1312
BHMBCCSNH01         1294
Others-CCCPS133     1294
BHMNCPLDH01         1292
BHMNCPPLS01         1291
BHMBCCPST01         1276
BHMEURBRD02         1276
NIA South           1204
NIA Car Parks       1204
BHMNCPRAN01         1186
BHMBRCBRG02         1186
BHMBRCBRG03         1186
Bull Ring           1186
BHMBRCBRG01         1186
BHMNCPNHS01         1038
NIA North            162
BHMBRTARC01           88
Name: SystemCodeNumber, dtype: int64
 

Checking the completeness of the data

In [3]:
df.isnull().sum()
Out[3]:
SystemCodeNumber    0
Capacity            0
Occupancy           0
LastUpdated         0
dtype: int64
 

Checking data type

In [4]:
df.dtypes
Out[4]:
SystemCodeNumber    object
Capacity             int64
Occupancy            int64
LastUpdated         object
dtype: object
 

Needs a date variable because it probably reflects the parking space best.

In [5]:
df.LastUpdated = pd.to_datetime(df.LastUpdated)
df.dtypes
Out[5]:
SystemCodeNumber            object
Capacity                     int64
Occupancy                    int64
LastUpdated         datetime64[ns]
dtype: object
 

I create date-related data.

In [6]:
df['month'] = df.LastUpdated.dt.month
df['hour'] = df.LastUpdated.dt.hour
df['weekday_name'] = df.LastUpdated.dt.weekday_name
df['weekday'] = df.LastUpdated.dt.weekday
In [7]:
df.head(4)
Out[7]:
  SystemCodeNumber Capacity Occupancy LastUpdated month hour weekday_name weekday
0 BHMBCCMKT01 577 61 2016-10-04 07:59:42 10 7 Tuesday 1
1 BHMBCCMKT01 577 64 2016-10-04 08:25:42 10 8 Tuesday 1
2 BHMBCCMKT01 577 80 2016-10-04 08:59:42 10 8 Tuesday 1
3 BHMBCCMKT01 577 107 2016-10-04 09:32:46 10 9 Tuesday 1
 

I choose parking for analysis.

In [8]:
# BHMBCCMKT01
# BHMNCPNST01
# BHMBCCTHL01 (48)
# BHMMBMMBX01 (34)
# BHMBCCSNH01
df2 = df.loc[df['SystemCodeNumber']=='BHMBCCTHL01'] 
df2.shape
Out[8]:
(1312, 8)
 

Sklearn linear regression model

In [9]:
X = df2[['month', 'hour', 'weekday'] ].values
y = df2['Occupancy'].values
In [10]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[10]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [11]:
import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
In [12]:
from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.47651714547905455
 

The value of the r square indicator for the linear regression model of sklearn is a little better but the model is still very weak.

Improving the linear regression model

Let’s start by checking the correlation with the dependent variable.

In [13]:
CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')
Out[13]:
  Occupancy
weekday -0.027324
month 0.314196
hour 0.662273
Occupancy 1.000000
Capacity NaN
 

One of the basic conditions for building a linear regression model is the linear relationship between dependent and independent variables.
It is known that when we are in the city and want to park, the hour is very important. It is difficult to park at peak times. But at the same hour on Saturday there are no cars.
Create a synthetic variable from a combination of weekday and hour. It is known that the chosen hour on Sunday is not the same as the same hour on Wednesday.

We create a synthetic variable: Hour on Weekday (HoW)

In [14]:
df2['combined_col'] = df2[['weekday_name', 'hour']].astype(str).apply('-'.join, axis=1)
df2['HoW'] = pd.Categorical(df2['combined_col']).codes
df2['HoW'].head(3)
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
Out[14]:
3882    57
3883    58
3884    58
Name: HoW, dtype: int8
 

We’ve created a synthetic variable that is a column containing the coded combination of day of the week and time. We can easily read this code.

In [15]:
df2[['HoW', 'weekday_name', 'hour']].head(4)
Out[15]:
  HoW weekday_name hour
3882 57 Tuesday 7
3883 58 Tuesday 8
3884 58 Tuesday 8
3885 59 Tuesday 9
In [16]:
CORREL = df2.corr().sort_values('Occupancy')
CORREL['Occupancy'].to_frame().sort_values('Occupancy')
Out[16]:
  Occupancy
weekday -0.027324
HoW 0.059408
month 0.314196
hour 0.662273
Occupancy 1.000000
Capacity NaN
 

Now I am adding a new synthetic variable to the linear regression model of the sklearn.

In [17]:
X = df2[['month', 'hour', 'weekday','HoW'] ].values
y = df2['Occupancy'].values
In [18]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[18]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [19]:
import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
In [20]:
from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.48074077157048334
 

As we can see, the introduction of a synthetic variable slightly improved the quality of the model.

Artykuł Improving the linear regression model using synthetic variables Part. 1 pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Polynomial regression model https://sigmaquality.pl/uncategorized/polynomial-regression-model/ Mon, 02 Dec 2019 19:14:00 +0000 http://sigmaquality.pl/polynomial-regression-model/ Polynomial regression model Polynomial regression model is technically a special case of multiple linear regression. This is definitions and explanation found in Wikipedia: https://en.wikipedia.org/wiki/Polynomial_regression. Just [...]

Artykuł Polynomial regression model pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Polynomial regression model

Polynomial regression model is technically a special case of multiple linear regression. This is definitions and explanation found in Wikipedia: https://en.wikipedia.org/wiki/Polynomial_regression.
Just quote the explanations found there to introduce you to the topic of polynomial regression.
Can be used in any linear equation for both linear regression and linear classification.

Polynomial regression model can be used in any linear equation for both linear regression and linear classification.

The goal of regression analysis is to model the expected value of a dependent variable y in terms of the value of an independent variable (or vector of independent variables) x. In simple linear regression, the model is used, where ε is an unobserved random error with mean zero conditioned on a scalar variable x. In this model, for each unit increase in the value of x, the conditional expectation of y increases by β1 units.

${displaystyle y=beta _{0}+beta _{1}x+varepsilon ,,}$

In many settings, such a linear relationship may not hold. For example, if we are modeling the yield of a chemical synthesis in terms of the temperature at which the synthesis takes place, we may find that the yield improves by increasing amounts for each unit increase in temperature. In this case, we might propose a quadratic model of the form

$ {displaystyle y=beta _{0}+beta _{1}x+beta _{2}x^{2}+varepsilon .,}$

In general, we can model the expected value of y as an nth degree polynomial, yielding the general polynomial regression model:

$ {displaystyle y=beta _{0}+beta _{1}x+beta _{2}x^{2}+beta _{3}x^{3}+cdots +beta _{n}x^{n}+varepsilon .,}$

Conveniently, these models are all linear from the point of view of estimation, since the regression function is linear in terms of the unknown parameters β0, β1, …. Therefore, for least squares analysis, the computational and inferential problems of polynomial regression can be completely addressed using the techniques of multiple regression. This is done by treating x, x2, … as being distinct independent variables in a multiple regression model.

Although polynomial regression is technically a special case of multiple linear regression, the interpretation of a fitted polynomial regression model requires a somewhat different perspective. It is often difficult to interpret the individual coefficients in a polynomial regression fit, since the underlying monomials can be highly correlated. For example, x and x2 have correlation around 0.97 when x is uniformly distributed on the interval (0, 1). Although the correlation can be reduced by using orthogonal polynomials, it is generally more informative to consider the fitted regression function as a whole.

Source: https://en.wikipedia.org/wiki/Polynomial_regression

Simply put, there are phenomena that are seasonal, winding in time. Such phenomena can be described by means of a linear function but the regression equation will be inefficient, the model will be imprecise.
Normal linear regression equation:

${displaystyle y=beta _{0}+beta _{1}x+varepsilon ,,}$

image.png

The use of a quadratic variable in a regression equation can greatly improve model properties:

$ {displaystyle y=beta _{0}+beta _{1}x+beta _{2}x^{2}+varepsilon .,}$

image.png

Source of plots: https://www.guru99.com/linear-classifier-tensorflow.html

Polynomial regression model in practice

Recently, I published a situation on my blog where I had to predict CO air pollution. The describing variables were temperature and humidity. Despite the use of shift, it was not possible to build an effective model (r square was 0.15).

I saved the data from that study: AirQ_shift

In [10]:
import pandas as pd

df = pd.read_csv('c:/TF/AirQ_shift.csv')
df.head(3)
Out[10]:
Unnamed: 0 DATE CO(GT) Shift_RH Shift_T Shift_weather_time
0 36 2004-03-10 18:00:00 2.6 58.1 10.5 2004-03-11 06:00:00
1 37 2004-03-10 19:00:00 2.0 59.6 10.2 2004-03-11 07:00:00
2 38 2004-03-10 20:00:00 2.2 57.4 10.8 2004-03-11 08:00:00

Linear regression model:

${displaystyle y=beta _{0}+beta _{1}x+varepsilon,}$

Declares X, y variables into the model.

In [13]:
X = df[['Shift_RH', 'Shift_T']].values
y = df['CO(GT)'].values

I divide the collection into training variables and test variables.

In [14]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

In [15]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[15]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [16]:
import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
In [19]:
from sklearn import metrics

print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.15437562015505324

Polynomial regression model:

$ {displaystyle y=beta _{0}+beta _{1}x+beta _{2}x^{2}+beta _{3}z+beta _{4}z^{2}+varepsilon ,}$

To build a polynomial regression model, you need to square independent variables: temperature and absolute humidity.
These variables squared will be treated as separate variables in the model.

So I create two new variables.

In [27]:
df['Shift_RH^2']=df['Shift_RH']**2
df['Shift_T^2']=df['Shift_T']**2
In [28]:
df.head(3)
Out[28]:
Unnamed: 0 DATE CO(GT) Shift_RH Shift_T Shift_weather_time Shift_RH^2 Shift_T^2
0 36 2004-03-10 18:00:00 2.6 58.1 10.5 2004-03-11 06:00:00 3375.61 110.25
1 37 2004-03-10 19:00:00 2.0 59.6 10.2 2004-03-11 07:00:00 3552.16 104.04
2 38 2004-03-10 20:00:00 2.2 57.4 10.8 2004-03-11 08:00:00 3294.76 116.64

We create polynomial model:

Declares new X, y variables into the model.
In [31]:
X = df[['Shift_RH', 'Shift_T', 'Shift_RH^2', 'Shift_T^2']].values
y = df['CO(GT)'].values
In [32]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[32]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [33]:
y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)
In [34]:
print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.15892318270410333

The change is unnoticed, econometric tools do not always improve the model significantly. What counts above all is the ability to apply the technique.

Artykuł Polynomial regression model pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Tutorial: Linear Regression – Sklearn regression model (#3/281120191333) https://sigmaquality.pl/tensorflow-3/linear-regression-3/ Thu, 28 Nov 2019 18:36:00 +0000 http://sigmaquality.pl/linear-regression-3/ In the previous part, we tried to build a model by trying to explain the level of carbon monoxide pollution based on temperature and pressure. [...]

Artykuł Tutorial: Linear Regression – Sklearn regression model (#3/281120191333) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

In the previous part, we tried to build a model by trying to explain the level of carbon monoxide pollution based on temperature and pressure. Despite the use of shift, the model proved to be inefficient. We will return to this model later. Meanwhile, we will now perform a linear regression model in Sklearn and Tensorflow.

In [1]:
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled2.csv')
df.head(3)
Out[1]:
Unnamed: 0 Unnamed: 0.1 Unnamed: 0.1.1 Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH DATE Month Weekday Weekday_name Hours
0 0 0 0 10/03/2004 18.00.00 2.6 1360.0 11.9 1046.0 166.0 1692.0 1268.0 13.6 48.9 0.7578 2004-03-10 18:00:00 3 2 Wednesday 18
1 1 1 1 10/03/2004 19.00.00 2.0 1292.0 9.4 955.0 103.0 1559.0 972.0 13.3 47.7 0.7255 2004-03-10 19:00:00 3 2 Wednesday 19
2 2 2 2 10/03/2004 20.00.00 2.2 1402.0 9.0 939.0 131.0 1555.0 1074.0 11.9 54.0 0.7502 2004-03-10 20:00:00 3 2 Wednesday 20

3 rows × 22 columns

In [2]:
del df['Unnamed: 0']

Checking the correlation with the result variable

We will consider carbon monoxide CO pollution as the result variable

In [3]:
CORREL = df.corr().sort_values('CO(GT)')
PCK = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PCK
Out[3]:
CO(GT)
PT08.S3(NOx) -0.715683
Weekday -0.140231
RH 0.020122
AH 0.025227
T 0.025639
Unnamed: 0.1 0.042099
Unnamed: 0.1.1 0.042099
Month 0.112291
Hours 0.344071
PT08.S4(NO2) 0.631854
NO2(GT) 0.682774
NOx(GT) 0.773677
PT08.S5(O3) 0.858762
PT08.S1(CO) 0.886114
PT08.S2(NMHC) 0.918386
C6H6(GT) 0.932584
CO(GT) 1.000000
In [4]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
PCK.plot(kind='barh', color='#6d9eeb')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')
Out[4]:
Text(0, 0.5, 'Continuous independent variables')

Part one: Linear regression in Sklearn

Declares X, y variables into the model. To the set of describing variables does not give data in text and date format.

In [5]:
df.columns
Out[5]:
Index(['Unnamed: 0.1', 'Unnamed: 0.1.1', 'Date', 'Time', 'CO(GT)',
       'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)', 'PT08.S3(NOx)',
       'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH', 'AH', 'DATE',
       'Month', 'Weekday', 'Weekday_name', 'Hours'],
      dtype='object')
In [6]:
X = df[['PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
        ,'Month','Weekday','Hours']].values
y = df['CO(GT)'].values

I divide the collection into training variables and test variables.

In [7]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

In [8]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[8]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [9]:
import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

Comparison of variables from the model with real variables.

In [10]:
dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)
Out[10]:
CO(GT) Actual CO(GT)_Predicted
0 1.3 1.18
1 0.6 0.76
2 0.4 0.45
3 0.5 0.44
4 0.8 1.08
In [11]:
dfKK.to_csv('c:/TF/AAr2.csv')
In [12]:
from sklearn import metrics

dfKK.head(50).plot()
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x215167771d0>
In [13]:
print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.9202224478702138

The R Square parameter shows that the model has close to perfection fit into empirical variables.

How to calculate the R Square parameter?

In the SKlearn package, the R-Square parameter is calculated easily and pleasantly. This is not always the case, for example in the TensorFlow library the calculation of this indicator is difficult. Therefore, we will now calculate the R-Square index on foot.

It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.
image.png

We will now calculate this difficult formula step by step for our example of linear regression.

Point 1. We calculate the square error

The quadratic error is the difference between the actual value and the estimation. We put this difference squared.</span>

In [14]:
dfKK.head(5)
Out[14]:
CO(GT) Actual CO(GT)_Predicted
0 1.3 1.18
1 0.6 0.76
2 0.4 0.45
3 0.5 0.44
4 0.8 1.08
In [15]:
dfKK['SSE'] = (dfKK['CO(GT) Actual'] - dfKK['CO(GT)_Predicted'])**2
dfKK.head(3)
Out[15]:
CO(GT) Actual CO(GT)_Predicted SSE
0 1.3 1.18 0.0144
1 0.6 0.76 0.0256
2 0.4 0.45 0.0025

Point 2. We calculate the average empirical value of y

In [16]:
dfKK['ave_y'] = dfKK['CO(GT) Actual'].mean()
dfKK.head(3)
Out[16]:
CO(GT) Actual CO(GT)_Predicted SSE ave_y
0 1.3 1.18 0.0144 2.086004
1 0.6 0.76 0.0256 2.086004
2 0.4 0.45 0.0025 2.086004

Point 3. We calculate the difference between empirical values y and the average of empirical values y

In [17]:
dfKK['SST'] = (dfKK['CO(GT) Actual'] - dfKK['ave_y'])**2
dfKK.head(3)
Out[17]:
CO(GT) Actual CO(GT)_Predicted SSE ave_y SST
0 1.3 1.18 0.0144 2.086004 0.617803
1 0.6 0.76 0.0256 2.086004 2.208209
2 0.4 0.45 0.0025 2.086004 2.842610

Point 4. We calculate the difference between sum of SST and sum of SSE

In [18]:
Sum_SST = dfKK['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = dfKK['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE
Sum_SST : 3829.9533119658117
Sum_SSE : 305.5443

Point 5. We calculate the R Square parameter

In [20]:
r2 = SSR/Sum_SST
print('R Square parameter: ',r2)
R Square parameter:  0.9202224478702138

Artykuł Tutorial: Linear Regression – Sklearn regression model (#3/281120191333) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Tutorial: Linear Regression – Tensorflow, calculation of R Square (#4/281120191525) https://sigmaquality.pl/tensorflow-3/linear-regression-4/ Thu, 28 Nov 2019 18:26:00 +0000 http://sigmaquality.pl/linear-regression-4/ We continue to learn how to build multiple linear regression models. This time we will build a model using the Tensorflow library. As before, the [...]

Artykuł Tutorial: Linear Regression – Tensorflow, calculation of R Square (#4/281120191525) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

We continue to learn how to build multiple linear regression models. This time we will build a model using the Tensorflow library. As before, the data file: AirQ_filled2.csv comes from previous episodes of this cycle.

In [1]:
import tensorflow as tf
import pandas as pd

df = pd.read_csv('c:/TF/AirQ_filled2.csv', usecols=['CO(GT)','PT08.S1(CO)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH', 'AH'
        ,'Month','Weekday','Hours'])
df.head(3)

Out[1]:
CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Month Weekday Hours
0 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578 3 2 18
1 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255 3 2 19
2 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502 3 2 20

Step 1: Convert Data

We convert numeric variables in the correct Tensorflow format. Tensorflow provides a continuous variable conversion method: tf.feature_column.numeric_column ().

Separation of a column into an independent variable and a dependent variable.

In [2]:
df.columns
Out[2]:
Index(['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)', 'NOx(GT)',
       'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 'PT08.S5(O3)', 'T', 'RH',
       'AH', 'Month', 'Weekday', 'Hours'],
      dtype='object')
In [3]:
df.columns = ['CO_GT', 'PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
In [4]:
df.dtypes
Out[4]:
CO_GT           float64
PT08.S1_CO      float64
C6H6_GT         float64
PT08.S2_NMHC    float64
NOx_GT          float64
PT08.S3_NOx     float64
NO2_GT          float64
PT08.S4_NO2     float64
PT08.S5_O3      float64
T               float64
RH              float64
AH              float64
Month             int64
Weekday           int64
Hours             int64
dtype: object
In [5]:
FEATURES = ['PT08.S1_CO', 'C6H6_GT', 'PT08.S2_NMHC',
       'NOx_GT', 'PT08.S3_NOx', 'NO2_GT', 'PT08.S4_NO2', 'PT08.S5_O3',
       'T', 'RH', 'AH', 'Month', 'Weekday', 'Hours']
LABEL = 'CO_GT'
In [6]:
PKS = [tf.feature_column.numeric_column(k) for k in FEATURES]
PKS
Out[6]:
[_NumericColumn(key='PT08.S1_CO', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='C6H6_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S2_NMHC', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NOx_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S3_NOx', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='NO2_GT', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S4_NO2', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='PT08.S5_O3', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='T', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='RH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='AH', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Month', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Weekday', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='Hours', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

Step 2: Defining the estimator

Tensorflow will automatically create a file called „Air” in your working directory. You must use this path to access Tensorboard. The estimator applies to independent variables.

In [7]:
estimator = tf.estimator.LinearRegressor(    
        feature_columns=PKS,   
        model_dir="Air")
INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': 'Air', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000017E850F7CC0>, '_task_type': 'worker', '_task_id': 0, '_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}

To instruct Tensorflow how to feed the model, you can use pandas_input_fn. This object needs 5 parameters: x: function data y: label data batch_size: batch. Default 128 num_epoch: by default number of epochs 1 random: Random or not data. Default None

In [8]:
def get_input_fn(data_set, num_epochs=None, n_batch = 128, shuffle=True):    
         return tf.estimator.inputs.pandas_input_fn(       
         x=pd.DataFrame({k: data_set[k].values for k in FEATURES}),       
         y = pd.Series(data_set[LABEL].values),       
         batch_size=n_batch,          
         num_epochs=num_epochs,       
         shuffle=shuffle)

Step 3: Model training

- To feed the model you can use the function created above: get_input_fn.
- Then you instruct the model to iterate 1000 times.
- Remember that you do not specify the number of epochs (num_epochs).
- It is better to set the number of epochs to none and define the number of iterations.

To test the model, we must divide the data set into a test set and a training set.

In [9]:
df_train=df.sample(frac=0.8,random_state=200)
df_test=df.drop(df_train.index)
print(df_train.shape, df_test.shape)
(7486, 15) (1871, 15)
In [10]:
estimator.train(input_fn=get_input_fn(df_train,                                       
                                           num_epochs=None,                                      
                                           n_batch = 128,                                      
                                           shuffle=False),                                      
                                           steps=1000)
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-10000
INFO:tensorflow:Saving checkpoints for 10001 into Airmodel.ckpt.
INFO:tensorflow:loss = 27.90989, step = 10001
INFO:tensorflow:global_step/sec: 231.067
INFO:tensorflow:loss = 19.266008, step = 10101 (0.443 sec)
INFO:tensorflow:global_step/sec: 250.047
INFO:tensorflow:loss = 21.174185, step = 10201 (0.389 sec)
INFO:tensorflow:global_step/sec: 244.378
INFO:tensorflow:loss = 26.823406, step = 10301 (0.409 sec)
INFO:tensorflow:global_step/sec: 263.037
INFO:tensorflow:loss = 16.690845, step = 10401 (0.380 sec)
INFO:tensorflow:global_step/sec: 250.698
INFO:tensorflow:loss = 24.08421, step = 10501 (0.399 sec)
INFO:tensorflow:global_step/sec: 254.447
INFO:tensorflow:loss = 16.630123, step = 10601 (0.406 sec)
INFO:tensorflow:global_step/sec: 248.812
INFO:tensorflow:loss = 25.998842, step = 10701 (0.389 sec)
INFO:tensorflow:global_step/sec: 269.371
INFO:tensorflow:loss = 31.432064, step = 10801 (0.387 sec)
INFO:tensorflow:global_step/sec: 255.634
INFO:tensorflow:loss = 22.70269, step = 10901 (0.391 sec)
INFO:tensorflow:Saving checkpoints for 11000 into Airmodel.ckpt.
INFO:tensorflow:Loss for final step: 24.21025.
Out[10]:
<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x17e850f7828>

Step 4. Model evaluation

To enter a test set, use the following code:

In [11]:
ev = estimator.evaluate(    
          input_fn=get_input_fn(df_test,                          
          num_epochs=1,                          
          n_batch = 356,                          
          shuffle=False))
INFO:tensorflow:Starting evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
INFO:tensorflow:Finished evaluation at 2019-11-28-13:40:17
INFO:tensorflow:Saving dict for global step 11000: average_loss = 0.18934268, global_step = 11000, loss = 59.04336

Print the loss using by the code below:

In [12]:
loss_score = ev["loss"]
print("Loss: {0:f}".format(loss_score))	
Loss: 59.043362

Calculation of R Square parameter using Tensorflow

I make a prediction on a test set

In [13]:
y = estimator.predict(    
         input_fn=get_input_fn(df_test,                          
         num_epochs=1,                          
         n_batch = 256,                          
         shuffle=False))
In [14]:
import itertools

predictions = list(p["predictions"] for p in itertools.islice(y, 1871))
#print("Predictions: {}".format(str(predictions)))
INFO:tensorflow:Restoring parameters from Airmodel.ckpt-11000
In [15]:
predictions
Out[15]:
[array([2.2904341], dtype=float32),
 array([1.4195127], dtype=float32),
 array([0.9917113], dtype=float32),
 array([1.4134599], dtype=float32),
 array([1.2086823], dtype=float32),
 array([1.4521222], dtype=float32),
 ...]

The model gave us a result string y. I am now processing this result string into a list.

In [16]:
import numpy as np

conc = np.vstack(predictions)
conc
Out[16]:
array([[2.2904341],
       [1.4195127],
       [0.9917113],
       ...,
       [1.2040666],
       [0.4435346],
       [3.111309 ]], dtype=float32)
In [48]:
ZHP = pd.DataFrame(conc)
ZHP.rename(columns={0:'y_pred'}, inplace=True)

kot = ZHP['y_pred'].values
kot = kot.astype('float32')
kot.dtype
Out[48]:
dtype('float32')

Now I’m creating a list of real y values from the test set.

In [50]:
y = df_test['CO_GT'].values
y = y.astype('float32')
y.dtype
Out[50]:
dtype('float32')

Now I create a dataframe with y-real and y-predicted variables.

In [47]:
PZU = pd.DataFrame({'y': y, 'y_pred': kot })
PZU.dtypes
Out[47]:
y         float64
y_pred    float64
dtype: object
In [63]:
def R_squared(y, y_pred):
    
  residual = tf.reduce_sum(tf.square(tf.subtract(y,y_pred)))
  total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
  r2 = tf.subtract(1.0, tf.div(residual, total))
  return r2

To use this function, both variables must have the same data type.

In [51]:
y.dtype
Out[51]:
dtype('float32')
In [52]:
kot.dtype
Out[52]:
dtype('float32')
In [65]:
residual = tf.reduce_sum(tf.square(tf.subtract(y,kot)))
In [66]:
total = tf.reduce_sum(tf.square(tf.subtract(y, tf.reduce_mean(y))))
In [67]:
r2 = tf.subtract(1.0, tf.div(residual, total))
In [68]:
r2
Out[68]:
<tf.Tensor 'Sub_27:0' shape=() dtype=float32>
In [77]:
sess = tf.Session()
a = sess.run(r2)
print('R Square parameter: ',a)
R Square parameter:  0.90320766

Calculation of R Square parameter using Pandas

In [78]:
PZU.head(5)
Out[78]:
y y_pred
0 2.2 2.290434
1 1.2 1.419513
2 1.0 0.991711
3 1.5 1.413460
4 1.6 1.471673
In [80]:
PZU['SSE'] = (PZU['y'] - PZU['y_pred'])**2
PZU.head(3)
Out[80]:
y y_pred SSE
0 2.2 2.290434 0.008178
1 1.2 1.419513 0.048186
2 1.0 0.991711 0.000069

Point 2. We calculate the average empirical value of y

In [81]:
PZU['ave_y'] = PZU['y'].mean()
PZU.head(3)
Out[81]:
y y_pred SSE ave_y
0 2.2 2.290434 0.008178 2.061304
1 1.2 1.419513 0.048186 2.061304
2 1.0 0.991711 0.000069 2.061304

Point 3. We calculate the difference between empirical values y and the average of empirical values y

In [83]:
PZU['SST'] = (PZU['y'] - PZU['ave_y'])**2
PZU.head(3)
Out[83]:
y y_pred SSE ave_y SST
0 2.2 2.290434 0.008178 2.061304 0.019237
1 1.2 1.419513 0.048186 2.061304 0.741845
2 1.0 0.991711 0.000069 2.061304 1.126366

Point 4. We calculate the difference between sum of SST and sum of SSE

In [84]:
Sum_SST = PZU['SST'].sum()
print('Sum_SST :',Sum_SST)
Sum_SSE = PZU['SSE'].sum()
print('Sum_SSE :',Sum_SSE)
SSR = Sum_SST - Sum_SSE
Sum_SST : 3659.9984179583107
Sum_SSE : 354.26016629427124

Point 5. We calculate the R Square parameter

In [85]:
r2 = SSR/Sum_SST
print('R Square parameter: ',r2)
R Square parameter:  0.903207562998923

Artykuł Tutorial: Linear Regression – Tensorflow, calculation of R Square (#4/281120191525) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Tutorial: Linear Regression – preliminary data preparation (#1/271120191024) https://sigmaquality.pl/tensorflow-3/tutorial_-linear-regression-in-tensorflow-part_1/ Wed, 27 Nov 2019 18:56:00 +0000 http://sigmaquality.pl/tutorial_-linear-regression-in-tensorflow-part_1/   Part 1. Preliminary data preparation   AirQualityUCI Source of data: https://archive.ics.uci.edu/ml/datasets/Air+Quality In [1]: import pandas as pd df = pd.read_csv('c:/TS/AirQualityUCI.csv', sep=';') df.head(3) Out[1]:   Date [...]

Artykuł Tutorial: Linear Regression – preliminary data preparation (#1/271120191024) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
 

Part 1. Preliminary data preparation

 
In [1]:
import pandas as pd
df = pd.read_csv('c:/TS/AirQualityUCI.csv', sep=';')
df.head(3)
Out[1]:
  Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH Unnamed: 15 Unnamed: 16
0 10/03/2004 18.00.00 2,6 1360.0 150.0 11,9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13,6 48,9 0,7578 NaN NaN
1 10/03/2004 19.00.00 2 1292.0 112.0 9,4 955.0 103.0 1174.0 92.0 1559.0 972.0 13,3 47,7 0,7255 NaN NaN
2 10/03/2004 20.00.00 2,2 1402.0 88.0 9,0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11,9 54,0 0,7502 NaN NaN
 

Data Set Information:


The dataset contains 9358 instances of hourly averaged responses from an array of 5 metal oxide chemical sensors embedded in an Air Quality Chemical Multisensor Device. The device was located on the field in a significantly polluted area, at road level,within an Italian city. Data were recorded from March 2004 to February 2005 (one year)representing the longest freely available recordings of on field deployed air quality chemical sensor devices responses. Ground Truth hourly averaged concentrations for CO, Non Metanic Hydrocarbons, Benzene, Total Nitrogen Oxides (NOx) and Nitrogen Dioxide (NO2) and were provided by a co-located reference certified analyzer. Evidences of cross-sensitivities as well as both concept and sensor drifts are present as described in De Vito et al., Sens. And Act. B, Vol. 129,2,2008 (citation required) eventually affecting sensors concentration estimation capabilities. Missing values are tagged with -200 value.
This dataset can be used exclusively for research purposes. Commercial purposes are fully excluded.

Supplementing data for further analysis

 

Attribute Information:

Date (DD/MM/YYYY)
Time (HH.MM.SS)
True hourly averaged concentration CO in mg/m^3 (reference analyzer)
PT08.S1 (tin oxide) hourly averaged sensor response (nominally CO targeted)
True hourly averaged overall Non Metanic HydroCarbons concentration in microg/m^3 (reference analyzer)
True hourly averaged Benzene concentration in microg/m^3 (reference analyzer)
PT08.S2 (titania) hourly averaged sensor response (nominally NMHC targeted)
True hourly averaged NOx concentration in ppb (reference analyzer)
PT08.S3 (tungsten oxide) hourly averaged sensor response (nominally NOx targeted)
True hourly averaged NO2 concentration in microg/m^3 (reference analyzer)
PT08.S4 (tungsten oxide) hourly averaged sensor response (nominally NO2 targeted)
PT08.S5 (indium oxide) hourly averaged sensor response (nominally O3 targeted)
Temperature in °C
Relative Humidity (
AH Absolute Humidity


Tutorial: Supplementing data for further analysis

 

Step 1. Data completeness check

In [2]:
df.isnull().sum()
Out[2]:
Date              114
Time              114
CO(GT)            114
PT08.S1(CO)       114
NMHC(GT)          114
C6H6(GT)          114
PT08.S2(NMHC)     114
NOx(GT)           114
PT08.S3(NOx)      114
NO2(GT)           114
PT08.S4(NO2)      114
PT08.S5(O3)       114
T                 114
RH                114
AH                114
Unnamed: 15      9471
Unnamed: 16      9471
dtype: int64
 

There are a lot of missing values. In addition, we learned that the value -200 means no data. We’ll deal with this in a moment. We will now check the statistics of variables in the database.

In [3]:
df.agg(['min', 'max', 'mean', 'median'])
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesnumpylibnanfunctions.py:1112: RuntimeWarning: Mean of empty slice
  return np.nanmean(a, axis, out=out, keepdims=keepdims)
Out[3]:
  PT08.S1(CO) NMHC(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) Unnamed: 15 Unnamed: 16
min -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 -200.000000 NaN NaN
max 2040.000000 1189.000000 2214.000000 1479.000000 2683.000000 340.000000 2775.000000 2523.000000 NaN NaN
mean 1048.990061 -159.090093 894.595276 168.616971 794.990168 58.148873 1391.479641 975.072032 NaN NaN
median 1053.000000 -200.000000 895.000000 141.000000 794.000000 96.000000 1446.000000 942.000000 NaN NaN
In [4]:
df.shape
Out[4]:
(9471, 17)
 

We delete two empty columns.

In [5]:
del df['Unnamed: 15']
del df['Unnamed: 16']
 

Step 1: Preliminary analysis of data gaps

One more look at how many NaN cells there are.

In [6]:
df.isnull().sum()
Out[6]:
Date             114
Time             114
CO(GT)           114
PT08.S1(CO)      114
NMHC(GT)         114
C6H6(GT)         114
PT08.S2(NMHC)    114
NOx(GT)          114
PT08.S3(NOx)     114
NO2(GT)          114
PT08.S4(NO2)     114
PT08.S5(O3)      114
T                114
RH               114
AH               114
dtype: int64
 

Now I will try to see these empty cells.

In [7]:
df[df['NMHC(GT)'].isnull()]
Out[7]:
  Date Time CO(GT) PT08.S1(CO) NMHC(GT) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
9357 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9358 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9359 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9360 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9361 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9466 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9467 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9468 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9469 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
9470 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

114 rows × 15 columns

 

These are completely empty time series. The device was probably cut off from the power supply, no sensor was working.

In [8]:
df = df.dropna(how='all')
df.isnull().sum()
Out[8]:
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64
 

We are looking for variables with the value -200 because this means there is no data. The -200 values are entered differently, so I have to do the replacement process in many ways.

In [9]:
import numpy as np

df = df.replace(-200,np.NaN)
df = df.replace('-200',np.NaN)
df = df.replace('-200.0',np.NaN)
df = df.replace('-200,0',np.NaN)
 

The value of -200 has been changed to NaN and we will see how many empty records there are now.

In [10]:
df.isnull().sum()
Out[10]:
Date                0
Time                0
CO(GT)           1683
PT08.S1(CO)       366
NMHC(GT)         8443
C6H6(GT)          366
PT08.S2(NMHC)     366
NOx(GT)          1639
PT08.S3(NOx)      366
NO2(GT)          1642
PT08.S4(NO2)      366
PT08.S5(O3)       366
T                 366
RH                366
AH                366
dtype: int64
 

Chart of missing data structure.

In [11]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
plt.show
Out[11]:
<function matplotlib.pyplot.show(*args, **kw)>
 

The NMHC (GT) variable is the most incomplete, we eliminate it from research

In [12]:
del df['NMHC(GT)']
 

We displaying records with missing data – Function isna ()

In [13]:
df1 = df[df.isna().any(axis=1)]
df1
Out[13]:
  Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
9 11/03/2004 03.00.00 0,6 1010.0 1,7 561.0 NaN 1705.0 NaN 1235.0 501.0 10,3 60,2 0,7517
10 11/03/2004 04.00.00 NaN 1011.0 1,3 527.0 21.0 1818.0 34.0 1197.0 445.0 10,1 60,5 0,7465
33 12/03/2004 03.00.00 0,8 889.0 1,9 574.0 NaN 1680.0 NaN 1187.0 512.0 7,0 62,3 0,6261
34 12/03/2004 04.00.00 NaN 831.0 1,1 506.0 21.0 1893.0 32.0 1134.0 384.0 6,1 65,9 0,6248
39 12/03/2004 09.00.00 NaN 1545.0 22,1 1353.0 NaN 767.0 NaN 2058.0 1588.0 9,2 56,2 0,6561
9058 23/03/2005 04.00.00 NaN 993.0 2,3 604.0 85.0 848.0 65.0 1160.0 762.0 14,5 66,4 1,0919
9130 26/03/2005 04.00.00 NaN 1122.0 6,0 811.0 181.0 641.0 92.0 1336.0 1122.0 16,2 71,2 1,3013
9202 29/03/2005 04.00.00 NaN 883.0 1,3 530.0 63.0 997.0 46.0 1102.0 617.0 13,7 68,2 1,0611
9274 01/04/2005 04.00.00 NaN 818.0 0,8 473.0 47.0 1257.0 41.0 898.0 323.0 13,7 48,8 0,7606
9346 04/04/2005 04.00.00 NaN 864.0 0,8 478.0 52.0 1116.0 43.0 958.0 489.0 11,8 56,0 0,7743

2416 rows × 14 columns

 

Step 2: Check the level of direct correlation to complete the data

CO (GT) there is no data there every few measurements, you have to check what this variable correlates with. I check the data type to make a correlation.

In [14]:
df.dtypes
Out[14]:
Date              object
Time              object
CO(GT)            object
PT08.S1(CO)      float64
C6H6(GT)          object
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                 object
RH                object
AH                object
dtype: object
In [15]:
# df['CO(GT)'].astype(float)
 

ValueError: could not convert string to float: '2,6′

It turns out that it is not so easy to convert text to number format – the problem is in commas. We replace commas with dots.

In [16]:
df['CO(GT)'] = df['CO(GT)'].str.replace(',','.')
df['C6H6(GT)'] = df['C6H6(GT)'].str.replace(',','.')
df['T'] = df['T'].str.replace(',','.')
df['RH'] = df['RH'].str.replace(',','.')
df['AH'] = df['AH'].str.replace(',','.')
 

We change the format from object to float

In [17]:
df[['CO(GT)','C6H6(GT)', 'T','RH','AH']] = df[['CO(GT)','C6H6(GT)', 'T','RH','AH']].astype(float)
In [18]:
df.dtypes
Out[18]:
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object
 

We can now check the level of direct correlation.

In [19]:
df.corr()
Out[19]:
  CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
CO(GT) 1.000000 0.879288 0.931078 0.915514 0.795028 -0.703446 0.683343 0.630703 0.854182 0.022109 0.048890 0.048556
PT08.S1(CO) 0.879288 1.000000 0.883795 0.892964 0.713654 -0.771938 0.641529 0.682881 0.899324 0.048627 0.114606 0.135324
C6H6(GT) 0.931078 0.883795 1.000000 0.981950 0.718839 -0.735744 0.614474 0.765731 0.865689 0.198956 -0.061681 0.167972
PT08.S2(NMHC) 0.915514 0.892964 0.981950 1.000000 0.704435 -0.796703 0.646245 0.777254 0.880578 0.241373 -0.090380 0.186933
NOx(GT) 0.795028 0.713654 0.718839 0.704435 1.000000 -0.655707 0.763111 0.233731 0.787046 -0.269683 0.221032 -0.149323
PT08.S3(NOx) -0.703446 -0.771938 -0.735744 -0.796703 -0.655707 1.000000 -0.652083 -0.538468 -0.796569 -0.145112 -0.056740 -0.232017
NO2(GT) 0.683343 0.641529 0.614474 0.646245 0.763111 -0.652083 1.000000 0.157360 0.708128 -0.186533 -0.091759 -0.335022
PT08.S4(NO2) 0.630703 0.682881 0.765731 0.777254 0.233731 -0.538468 0.157360 1.000000 0.591144 0.561270 -0.032188 0.629641
PT08.S5(O3) 0.854182 0.899324 0.865689 0.880578 0.787046 -0.796569 0.708128 0.591144 1.000000 -0.027172 0.124956 0.070751
T 0.022109 0.048627 0.198956 0.241373 -0.269683 -0.145112 -0.186533 0.561270 -0.027172 1.000000 -0.578621 0.656397
RH 0.048890 0.114606 -0.061681 -0.090380 0.221032 -0.056740 -0.091759 -0.032188 0.124956 -0.578621 1.000000 0.167971
AH 0.048556 0.135324 0.167972 0.186933 -0.149323 -0.232017 -0.335022 0.629641 0.070751 0.656397 0.167971 1.000000
In [20]:
sns.set(style="ticks")

corr = df.corr()

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize=(22, 10))
cmap = sns.diverging_palette(180, 50, as_cmap=True)

sns.heatmap(corr, mask=mask, cmap=cmap, vmax=1.3, center=0.1,annot=True,
            square=True, linewidths=.9, cbar_kws={"shrink": 0.8})
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ad8b43390>
 

Step 3. Filling the gaps in variables based on other variables correlated with it

Filling gaps in the CO (GT) variable.

I check what this variable is strongly correlated with and supplement based on this variable, if not, I supplement it as the last or next value.

In [21]:
df.corr()
Out[21]:
  CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
CO(GT) 1.000000 0.879288 0.931078 0.915514 0.795028 -0.703446 0.683343 0.630703 0.854182 0.022109 0.048890 0.048556
PT08.S1(CO) 0.879288 1.000000 0.883795 0.892964 0.713654 -0.771938 0.641529 0.682881 0.899324 0.048627 0.114606 0.135324
C6H6(GT) 0.931078 0.883795 1.000000 0.981950 0.718839 -0.735744 0.614474 0.765731 0.865689 0.198956 -0.061681 0.167972
PT08.S2(NMHC) 0.915514 0.892964 0.981950 1.000000 0.704435 -0.796703 0.646245 0.777254 0.880578 0.241373 -0.090380 0.186933
NOx(GT) 0.795028 0.713654 0.718839 0.704435 1.000000 -0.655707 0.763111 0.233731 0.787046 -0.269683 0.221032 -0.149323
PT08.S3(NOx) -0.703446 -0.771938 -0.735744 -0.796703 -0.655707 1.000000 -0.652083 -0.538468 -0.796569 -0.145112 -0.056740 -0.232017
NO2(GT) 0.683343 0.641529 0.614474 0.646245 0.763111 -0.652083 1.000000 0.157360 0.708128 -0.186533 -0.091759 -0.335022
PT08.S4(NO2) 0.630703 0.682881 0.765731 0.777254 0.233731 -0.538468 0.157360 1.000000 0.591144 0.561270 -0.032188 0.629641
PT08.S5(O3) 0.854182 0.899324 0.865689 0.880578 0.787046 -0.796569 0.708128 0.591144 1.000000 -0.027172 0.124956 0.070751
T 0.022109 0.048627 0.198956 0.241373 -0.269683 -0.145112 -0.186533 0.561270 -0.027172 1.000000 -0.578621 0.656397
RH 0.048890 0.114606 -0.061681 -0.090380 0.221032 -0.056740 -0.091759 -0.032188 0.124956 -0.578621 1.000000 0.167971
AH 0.048556 0.135324 0.167972 0.186933 -0.149323 -0.232017 -0.335022 0.629641 0.070751 0.656397 0.167971 1.000000
In [22]:
df.dtypes
Out[22]:
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object
In [23]:
print('missing value in CO(GT): ',df['CO(GT)'].isnull().sum())
missing value in CO(GT):  1683
 

CO (GT) correlation with other variables.

In [24]:
CORREL = df.corr()
CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
Out[24]:
  CO(GT)
PT08.S3(NOx) -0.703446
T 0.022109
AH 0.048556
RH 0.048890
PT08.S4(NO2) 0.630703
NO2(GT) 0.683343
NOx(GT) 0.795028
PT08.S5(O3) 0.854182
PT08.S1(CO) 0.879288
PT08.S2(NMHC) 0.915514
C6H6(GT) 0.931078
CO(GT) 1.000000
 

The largest correlation with CO (GT) occurs for C6H6 (GT) which is quite complete. Based on this variable, I fill in the deficiencies in CO (GT).

In [25]:
df['CO(GT)'] = df.groupby('C6H6(GT)')['CO(GT)'].apply(lambda x: x.ffill().bfill())
In [26]:
print('missing value: ',df['CO(GT)'].isnull().sum())
missing value:  383
In [27]:
df['CO(GT)'] = df.groupby('PT08.S1(CO)')['CO(GT)'].apply(lambda x: x.ffill().bfill())
In [28]:
print('missing value: ',df['CO(GT)'].isnull().sum())
missing value:  370
 

Now I do simple refilling – the last good value.

In [29]:
df['CO(GT)'].fillna(method='ffill', inplace=True)   
In [30]:
print('missing value: ',df['CO(GT)'].isnull().sum())
missing value:  0
 

Filling gaps in the variable 'C6H6 (GT)’

In [31]:
print('missing value: ',df['C6H6(GT)'].isnull().sum())
missing value:  366
In [32]:
df['C6H6(GT)'] = df.groupby('CO(GT)')['C6H6(GT)'].apply(lambda x: x.ffill().bfill())
In [33]:
print('missing value: ',df['C6H6(GT)'].isnull().sum())
missing value:  0
 

Filling gaps in the variable 'NOx(GT)’

In [34]:
print('brakuje wartości: ',df['NOx(GT)'].isnull().sum())
brakuje wartości:  1639
In [35]:
CORREL['NOx(GT)'].to_frame().sort_values('NOx(GT)')
Out[35]:
  NOx(GT)
PT08.S3(NOx) -0.655707
T -0.269683
AH -0.149323
RH 0.221032
PT08.S4(NO2) 0.233731
PT08.S2(NMHC) 0.704435
PT08.S1(CO) 0.713654
C6H6(GT) 0.718839
NO2(GT) 0.763111
PT08.S5(O3) 0.787046
CO(GT) 0.795028
NOx(GT) 1.000000
In [36]:
df['NOx(GT)'] = df.groupby('CO(GT)')['NOx(GT)'].apply(lambda x: x.ffill().bfill())
In [37]:
print('missing value: ',df['NOx(GT)'].isnull().sum())
missing value:  0
 

Filling gaps in the variable 'C6H6 (GT)’

In [38]:
print('missing value: ',df['NO2(GT)'].isnull().sum())
missing value:  1642
In [39]:
CORREL['NO2(GT)'].to_frame().sort_values('NO2(GT)')
Out[39]:
  NO2(GT)
PT08.S3(NOx) -0.652083
AH -0.335022
T -0.186533
RH -0.091759
PT08.S4(NO2) 0.157360
C6H6(GT) 0.614474
PT08.S1(CO) 0.641529
PT08.S2(NMHC) 0.646245
CO(GT) 0.683343
PT08.S5(O3) 0.708128
NOx(GT) 0.763111
NO2(GT) 1.000000
In [40]:
df['NO2(GT)'] = df.groupby('PT08.S5(O3)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())
In [41]:
df['NO2(GT)'] = df.groupby('CO(GT)')['NO2(GT)'].apply(lambda x: x.ffill().bfill())
In [42]:
print('missing value: ',df['NO2(GT)'].isnull().sum())
missing value:  0
In [43]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='YlGnBu')
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ad8fea080>
 

I complete the records where the entire measuring device did not work.

In the drawing it can be seen as solid lines.

In [44]:
df.shape
Out[44]:
(9357, 14)
In [45]:
df.fillna(method='ffill', inplace=True)
In [46]:
df.shape
Out[46]:
(9357, 14)
In [47]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='Reds')
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x18ad95756d8>
In [48]:
df.isnull().sum()
Out[48]:
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64
 

Data have been completed! In the next parts of this tutorial, we will carry out the process of building a linear regression model at TensorFlow.

Let’s save the completed file to disk

df.to_csv('c:/TF/AirQ_filled.csv')
df2 = pd.read_csv('c:/TF/AirQ_filled.csv')
df2.head(3)

Tutorial: Supplementing data for further analysis

Artykuł Tutorial: Linear Regression – preliminary data preparation (#1/271120191024) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334) https://sigmaquality.pl/uncategorized/linear-regression-2/ Wed, 27 Nov 2019 18:37:00 +0000 http://sigmaquality.pl/linear-regression-2/ Part 2. Simple multifactorial linear regression In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed [...]

Artykuł Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Part 2. Simple multifactorial linear regression

In the previous part of this tutorial, we cleaned the data file from the measuring station. A new, completed measurement data file has been created, which we will now open.

We will now continue to prepare data for further analysis. One of the most important variables describing in linear regression is time. Most artificial and natural phenomena operate in hourly, daily and monthly cycles.

In [1]:
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)
Out[1]:
Unnamed: 0 Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 0 10/03/2004 18.00.00 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
1 1 10/03/2004 19.00.00 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255
2 2 10/03/2004 20.00.00 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502

Step 1. Launching the time variable

We check what the date format is

In [2]:
df[['Date','Time']].dtypes
Out[2]:
Date    object
Time    object
dtype: object

There is no date format in dataframe. Link columns containing time.

In [3]:
df['DATE'] = df['Date']+' '+df['Time']
df['DATE'].head()
Out[3]:
0    10/03/2004 18.00.00
1    10/03/2004 19.00.00
2    10/03/2004 20.00.00
3    10/03/2004 21.00.00
4    10/03/2004 22.00.00
Name: DATE, dtype: object

We create a new column containing the date and time. Then we convert the object format to the date format.

In [4]:
df['DATE'] = pd.to_datetime(df.DATE, format='
df.dtypes
Out[4]:
Unnamed: 0                int64
Date                     object
Time                     object
CO(GT)                  float64
PT08.S1(CO)             float64
C6H6(GT)                float64
PT08.S2(NMHC)           float64
NOx(GT)                 float64
PT08.S3(NOx)            float64
NO2(GT)                 float64
PT08.S4(NO2)            float64
PT08.S5(O3)             float64
T                       float64
RH                      float64
AH                      float64
DATE             datetime64[ns]
dtype: object

Step 2. We add more columns based on the time variable

In industry, the day of the week is very important, so in such models it is worth adding a column with the number of the day.

In [5]:
df['Month'] = df['DATE'].dt.month
df['Weekday'] = df['DATE'].dt.weekday
df['Weekday_name'] = df['DATE'].dt.weekday_name
df['Hours'] = df['DATE'].dt.hour
In [6]:
df[['DATE','Month','Weekday','Weekday_name','Hours']].sample(3)
Out[6]:
DATE Month Weekday Weekday_name Hours
6109 2004-11-20 07:00:00 11 5 Saturday 7
3537 2004-08-05 03:00:00 8 3 Thursday 3
8053 2005-02-09 07:00:00 2 2 Wednesday 7

Graphical analysis of pollution according to time variables

In [7]:
df.pivot_table(index='Weekday_name', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f20bbdb38>
In [8]:
df.pivot_table(index='Month', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f20f02320>
In [9]:
df.pivot_table(index='Hours', values='CO(GT)', aggfunc='mean').plot(kind='bar')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f210e77f0>

Step 3. Correlation analysis

we set the result variable as:

CO(GT) – actual hourly average CO concentration in mg / m^3 (reference analyzer)

In [10]:
del df['Unnamed: 0']
In [11]:
CORREL = df.corr()
PKP = CORREL['CO(GT)'].to_frame().sort_values('CO(GT)')
PKP
Out[11]:
CO(GT)
PT08.S3(NOx) -0.715683
Weekday -0.140231
RH 0.020122
AH 0.025227
T 0.025639
Month 0.112291
Hours 0.344071
PT08.S4(NO2) 0.631854
NO2(GT) 0.682774
NOx(GT) 0.773677
PT08.S5(O3) 0.858762
PT08.S1(CO) 0.886114
PT08.S2(NMHC) 0.918386
C6H6(GT) 0.932584
CO(GT) 1.000000
In [12]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
PKP.plot(kind='barh', color='red')
plt.title('Correlation with the resulting variable: CO ', fontsize=20)
plt.xlabel('Correlation level')
plt.ylabel('Continuous independent variables')
Out[12]:
Text(0, 0.5, 'Continuous independent variables')
<Figure size 720x576 with 0 Axes>

Variables based on time are not very well correlated with the described variable: CO (GT).
The temptation arises to use better, better correlated independent variables for the model. The problem is that these variables can be part of the result. So if there is pollution then all substances are in the air.

Our task is to examine how weather and time affect the level of pollution. We’ll cover this task in the next part of the tutorial.

Step 4. We are now checking shift

for independent variables with low direct correlation.
How does weather affect CO2 levels?

Variable RH – Relative humidity (

We check a variable with very low correlation with the resulting CO (GT) variable

In [13]:
def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF
In [14]:
x = df.RH       # independent variable
y = df['CO(GT)']    # dependent variable
R = 20           # number of shifts who will be checked
In [15]:
SKO = shift_Factor(x,y,R)
print('Optimal shift for RH: ',SKO)
Optimal shift for RH:  12
In [16]:
cross_corr(x, y, lag=SKO)
Out[16]:
0.39204313671898056

Variable AH – Absolute humidity

We check a variable with very low correlation with the resulting CO (GT) variable

In [17]:
x = df.AH       # independent variable
SKP = shift_Factor(x,y,R)
print('Optimal shift for AH: ',SKP)
Optimal shift for AH:  12
In [18]:
cross_corr(x, y, lag=SKP)
Out[18]:
0.043756364102677595

Absolute humidity AH does not correlate with the variable CO (GT) so we eliminate it from the model

Variable: Temperature in ° C

We check a variable with very low correlation with the resulting CO (GT) variable.

In [19]:
x = df['T']      # independent variable
PKP = shift_Factor(x,y,R)
print('Optimal shift for T: ',PKP)
Optimal shift for T:  12
In [20]:
cross_corr(x, y, lag=PKP)
Out[20]:
-0.22446569561762522

We are now creating a new DataFrame with a 12-hour shift

It turns out that temperature and humidity only correlate after 12 hours from the time the CO contamination changes.
Data shift creation function.

In [21]:
def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)

Our goal is to create a multiple regression model:

- Independent variables are: Temperature (T) and Relative humidity RH (
- The dependent variable is the level of CO (GT)
In [22]:
df2 = df[['DATE', 'CO(GT)','RH', 'T']]

Adds a date and time to record temperature and humidity.

In [23]:
df2['weather_time'] = df2['DATE']
C:ProgramDataAnaconda3envsOLD_TFlibsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.
In [24]:
df2.head(3)
Out[24]:
DATE CO(GT) RH T weather_time
0 2004-03-10 18:00:00 2.6 48.9 13.6 2004-03-10 18:00:00
1 2004-03-10 19:00:00 2.0 47.7 13.3 2004-03-10 19:00:00
2 2004-03-10 20:00:00 2.2 54.0 11.9 2004-03-10 20:00:00
In [25]:
df3 = df_shif(df2, 'weather_time', lag=12)
df3.rename(columns={'weather_time':'Shift_weather_time'}, inplace=True) 
df3.head(13)
Out[25]:
DATE CO(GT) RH T Shift_weather_time
0 NaT NaN NaN NaN 2004-03-10 18:00:00
1 NaT NaN NaN NaN 2004-03-10 19:00:00
2 NaT NaN NaN NaN 2004-03-10 20:00:00
3 NaT NaN NaN NaN 2004-03-10 21:00:00
4 NaT NaN NaN NaN 2004-03-10 22:00:00
5 NaT NaN NaN NaN 2004-03-10 23:00:00
6 NaT NaN NaN NaN 2004-03-11 00:00:00
7 NaT NaN NaN NaN 2004-03-11 01:00:00
8 NaT NaN NaN NaN 2004-03-11 02:00:00
9 NaT NaN NaN NaN 2004-03-11 03:00:00
10 NaT NaN NaN NaN 2004-03-11 04:00:00
11 NaT NaN NaN NaN 2004-03-11 05:00:00
12 2004-03-10 18:00:00 2.6 48.9 13.6 2004-03-11 06:00:00
In [26]:
df4 = df_shif(df3, 'RH', lag=12)
df4.rename(columns={'RH':'Shift_RH'}, inplace=True) 
In [27]:
df5 = df_shif(df4, 'T', lag=12)
df5.rename(columns={'T':'Shift_T'}, inplace=True) 

Deletes rows with incomplete data.

In [28]:
df5 = df5.dropna(how ='any')
In [29]:
df5.head()
Out[29]:
DATE CO(GT) Shift_RH Shift_T Shift_weather_time
36 2004-03-10 18:00:00 2.6 58.1 10.5 2004-03-11 06:00:00
37 2004-03-10 19:00:00 2.0 59.6 10.2 2004-03-11 07:00:00
38 2004-03-10 20:00:00 2.2 57.4 10.8 2004-03-11 08:00:00
39 2004-03-10 21:00:00 2.2 60.6 10.5 2004-03-11 09:00:00
40 2004-03-10 22:00:00 1.6 58.4 10.8 2004-03-11 10:00:00

The table can be understood as meaning that a specific temperature at 6:00 gives a specific concentration of carbon monoxide at 18:00.

Graphical analysis of relationships – Humidity and temperature to carbon monoxide

It looks rather poor

In [30]:
import matplotlib.pyplot as plt

df5.plot(x='Shift_T', y='CO(GT)', style='o')  
plt.title('Shift_T vs CO(GT)')  
plt.xlabel('Shift_T')  
plt.ylabel('CO(GT)')  
plt.show()
In [31]:
df5.plot(x='Shift_RH', y='CO(GT)', style='o')  
plt.title('Shift_RH vs CO(GT)')  
plt.xlabel('Shift_RH')  
plt.ylabel('CO(GT)')  
plt.show()

Step 5. Building a multiple linear regression model in Sklearn

Declares X, y variables into the model.

In [32]:
X = df5[['Shift_RH', 'Shift_T']].values
y = df5['CO(GT)'].values

I divide the collection into training variables and test variables.

In [33]:
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

I am building a regression model.

In [34]:
regressor = LinearRegression()  
regressor.fit(X_train, y_train)
Out[34]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [35]:
import numpy as np

y_pred = regressor.predict(X_test)
y_pred = np.round(y_pred, decimals=2)

Comparison of variables from the model with real variables.

In [36]:
dfKK = pd.DataFrame({'CO(GT) Actual': y_test, 'CO(GT)_Predicted': y_pred})
dfKK.head(5)
Out[36]:
CO(GT) Actual CO(GT)_Predicted
0 0.5 1.63
1 1.9 1.91
2 3.4 2.40
3 1.2 1.45
4 2.4 2.40
In [37]:
from sklearn import metrics

dfKK.head(50).plot()
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x12f2b9a5898>
In [38]:
from sklearn import metrics

print('Mean Absolute Error:    ', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:     ', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Mean Absolute Error:     1.0011099195710456
Mean Squared Error:      1.779567238605898
Root Mean Squared Error: 1.3340042123643756
In [39]:
print('Mean Squared Error:     ', metrics.r2_score(y_test, y_pred))
Mean Squared Error:      0.15437562015505324

Carbon monoxide contamination cannot be predicted based on humidity and temperature.
In the next part, we will continue the analysis and preparation of data for linear regression.

Artykuł Tutorial: Linear Regression – Time variables and shifts. Use of offset in variable correlation (#2/271120191334) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Example of the use of shift for linear regression in Python. How to find optimal correlation shift? https://sigmaquality.pl/uncategorized/shift-for-linear-regression/ Thu, 14 Nov 2019 18:10:00 +0000 http://sigmaquality.pl/shift-for-linear-regression/ What is this the correlation shift? In supervised deep machine learning we have two directions: classification and regression. Regression needs continuous values of data. Because [...]

Artykuł Example of the use of shift for linear regression in Python. How to find optimal correlation shift? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

What is this the correlation shift?

In supervised deep machine learning we have two directions: classification and regression. Regression needs continuous values of data. Because from time to time we are forced to transform discrete data into continues values.

More important, have to say, is to find linear correlation between independent variables and dependent variable who represents result.

How to find correlation?

In natural environment everything is correlated each other. Rain causes the level of the lake to rise. Hot sun causes of the level of the lake to down. It is obvious examples of linear correlation.

But to observe it, use simple correlation can be insufficient.

The problem is in the shift. Rain contribute to rise water in river but this rise appears after couple of hours. Sun makes level of water to down after couple of days. Frankly speaking most correlation in all environments have longer or shorter delays.

How to find correlation shift?

In [1]:
from scipy import signal, fftpack
import pandas as pd
import numpy

Let’s build this dataframe.

In [2]:
AAA = [295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 175, 250,
       190, 265, 205, 280, 220, 295, 235, 310, 250, 325, 265, 340, 280,
       355, 295, 370, 310, 385, 325, 400, 340, 415, 355, 430, 370, 445,
       385, 460, 400, 475, 415, 490, 430, 175, 250, 190, 265, 205, 280,
       220, 295, 235, 310, 250, 325, 265, 340, 280, 355, 295, 370, 310,
       385, 325, 400, 340, 415, 355, 430, 370, 445, 385, 460, 400, 475,
       415, 490, 430, 505, 445, 175, 250, 190, 265, 205, 280, 220, 295,
       235, 310, 250, 325, 265, 340, 280, 355]

BBB = [123, 221, 113, 105, 150, 114, 159, 123, 168, 132, 177, 141, 186,
       150, 195, 159, 204, 168, 213, 177, 222, 186, 231, 195, 240, 204,
       249, 213, 258, 222, 267, 231, 276, 240, 285, 249, 294, 258, 105,
       150, 114, 159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204,
       168, 213, 177, 222, 186, 231, 195, 240, 204, 249, 213, 258, 222,
       267, 231, 276, 240, 285, 249, 294, 258, 303, 267, 105, 150, 114,
       159, 123, 168, 132, 177, 141, 186, 150, 195, 159, 204, 168, 213,
       177, 222, 186, 231, 195, 240, 204, 249]

CCC = [124, 154, 130, 160, 136, 166, 142, 172, 148,  70, 100,  76, 106,
        82, 112,  88, 118,  94, 124, 100, 130, 106, 136, 112, 142, 118,
       148, 124, 154, 130, 160, 136, 166, 142, 172, 148, 178, 154, 184,
       160, 190, 166, 196, 172,  70, 100,  76, 106,  82, 112,  88, 118,
        94, 124, 100, 130, 106, 136, 112, 142, 118, 148, 124, 154, 130,
       160, 136, 166, 142, 172, 148, 178, 154, 184, 160, 190, 166, 196,
       172, 202, 178,  70, 100,  76, 106,  82, 112,  88, 118,  94, 124,
       100, 130, 106, 136, 112, 142, 118, 148]

DDD = [ 437,  453,  764,  346,  239,  420,  600,  456,  636,  492,  672,
        528,  708,  564,  744,  600,  780,  636,  816,  672,  852,  708,
        888,  744,  924,  780,  960,  816,  996,  852, 1032,  888, 1068,
        924, 1104,  960, 1140,  996, 1176, 1032,  420,  600,  456,  636,
        492,  672,  528,  708,  564,  744,  600,  780,  636,  816,  672,
        852,  708,  888,  744,  924,  780,  960,  816,  996,  852, 1032,
        888, 1068,  924, 1104,  960, 1140,  996, 1176, 1032, 1212, 1068,
        420,  600,  456,  636,  492,  672,  528,  708,  564,  744,  600,
        780,  636,  816,  672,  852,  708,  888,  744,  924,  780,  960]

RESULT = [ 35,  50,  38,  53,  41,  56,  44,  59,  47,  62,  50,  65,  53,
        68,  56,  71,  59,  74,  62,  77,  65,  80,  68,  83,  71,  86,
        74,  89,  77,  92,  80,  95,  83,  98,  86,  35,  50,  38,  53,
        41,  56,  44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,
        74,  62,  77,  65,  80,  68,  83,  71,  86,  74,  89,  77,  92,
        80,  95,  83,  98,  86, 101,  89,  35,  50,  38,  53,  41,  56,
        44,  59,  47,  62,  50,  65,  53,  68,  56,  71,  59,  74,  62,
        77,  65,  80,  68,  83,  71,  86,  74]


df = pd.DataFrame({'AAA': AAA, 'BBB': BBB,'CCC':CCC,'DDD':DDD, 'RESULT':RESULT})

df.head()
Out[2]:
AAA BBB CCC DDD RESULT
0 295 123 124 437 35
1 370 221 154 453 50
2 310 113 130 764 38
3 385 105 160 346 53
4 325 150 136 239 41

Descriptive in the DataFrame phenomena are perfectly correlated. But we don’t know about it. Now we use ordinary method of searching correlation.

In [3]:
corr = df.corr()
corr
Out[3]:
AAA BBB CCC DDD RESULT
AAA 1.000000 0.072278 0.715892 0.206945 -0.261955
BBB 0.072278 1.000000 0.244349 0.748050 0.383326
CCC 0.715892 0.244349 1.000000 0.389072 -0.169164
DDD 0.206945 0.748050 0.389072 1.000000 0.248511
RESULT -0.261955 0.383326 -0.169164 0.248511 1.000000
In [4]:
corr['RESULT']
Out[4]:
AAA      -0.261955
BBB       0.383326
CCC      -0.169164
DDD       0.248511
RESULT    1.000000
Name: RESULT, dtype: float64

Is it all? Is it entire correlation for linear regression? How to find correlation delay?

Function to find optimal correlation shift

I made special function to detect optimal shift values for maximal linear correlation between dependent and independent variables.

In [5]:
def cross_corr(x, y, lag=0):
    return x.corr(y.shift(lag))

def shift_Factor(x,y,R):
    x_corr = [cross_corr(x, y, lag=i) for i in range(R)]
    
    # R factor is the number of the shifts who should be checked by the function
    Kot = pd.DataFrame(list(x_corr)).reset_index()
    Kot.rename(columns={0:'Corr', 'index':'Shift_num'}, inplace=True)
    
    # We find optimal correlation shift
    Kot['abs'] = Kot['Corr'].abs()
    SF = Kot.loc[Kot['abs']==Kot['abs'].max(), 'Shift_num']
    p1 = SF.to_frame()
    SF = p1.Shift_num.max()
    
    return SF

We declare variables to function.

In [6]:
x = df.AAA       # independent variable
y = df.RESULT    # dependent variable
R = 20           # number of shifts who will be checked

Shift for variable AAA

We are looking optimal correlation shift in variable AAA.

In [13]:
SKO = shift_Factor(x,y,R)
print('Optimal shift for AAA: ',SKO)
Optimal shift for AAA:  11

We calculate that in 11 rows of shifts there are the biggest correlations between AAA independent variable and RESULT variable (in absolute values). What is the level of correlation?

In [8]:
cross_corr(x, y, lag=SKO)
Out[8]:
0.9999999999999996

We create new DateFrame with optimal shift.

In [9]:
def df_shif(df, target=None, lag=0):
    if not lag and not target:
        return df       
    new = {}
    for h in df.columns:
        if h == target:
            new[h] = df[target]
        else:
            new[h] = df[h].shift(periods=lag)
    return  pd.DataFrame(data=new)
In [10]:
df2 = df_shif(df, 'AAA', lag=SKO)
df2.rename(columns={'AAA':'SHIFTED AAA'}, inplace=True) 
df2.head(13)
Out[10]:
SHIFTED AAA BBB CCC DDD RESULT
0 295 NaN NaN NaN NaN
1 370 NaN NaN NaN NaN
2 310 NaN NaN NaN NaN
3 385 NaN NaN NaN NaN
4 325 NaN NaN NaN NaN
5 400 NaN NaN NaN NaN
6 340 NaN NaN NaN NaN
7 415 NaN NaN NaN NaN
8 355 NaN NaN NaN NaN
9 430 NaN NaN NaN NaN
10 370 NaN NaN NaN NaN
11 175 123.0 124.0 437.0 35.0
12 250 221.0 154.0 453.0 50.0

Now we repeat these manuals for rest independent variables.

Shift for variable BBB

In [11]:
BBB = df.BBB       # independent variable
In [14]:
SKS = shift_Factor(BBB,y,R)
print('Optimal shift for BBB: ',SKS)
Optimal shift for BBB:  3
In [16]:
df3 = df_shif(df2, 'BBB', lag=SKS)
df3.rename(columns={'BBB':'SHIFTED BBB'}, inplace=True)

Shift for variable CCC

In [18]:
CCC = df.CCC
In [26]:
SKK = shift_Factor(CCC,y,R)
print('Optimal shift for CCC: ',SKK)
Optimal shift for CCC:  9
In [22]:
df4 = df_shif(df3, 'CCC', lag=SKK)
df4.rename(columns={'CCC':'SHIFTED CCC'}, inplace=True)

Shift for variable DDD

In [23]:
DDD = df.DDD
In [27]:
PKP = shift_Factor(DDD,y,R)
print('Optimal shift for DDD: ',PKP)
Optimal shift for DDD:  5
In [31]:
df5 = df_shif(df4, 'DDD', lag=PKP)
df5.rename(columns={'DDD':'SHIFTED DDD'}, inplace=True)

Correlation after make the shifts

I wipe rows in dataframe where appear NaN values and calculate correlation.

In [33]:
df5 = df5.dropna(how='any')
df5.head(3)
Out[33]:
SHIFTED AAA SHIFTED BBB SHIFTED CCC SHIFTED DDD RESULT
28 175.0 105.0 70.0 420.0 35.0
29 250.0 150.0 100.0 600.0 50.0
30 190.0 114.0 76.0 456.0 38.0
In [34]:
corr = df5.corr()
corr
Out[34]:
SHIFTED AAA SHIFTED BBB SHIFTED CCC SHIFTED DDD RESULT
SHIFTED AAA 1.0 1.0 1.0 1.0 1.0
SHIFTED BBB 1.0 1.0 1.0 1.0 1.0
SHIFTED CCC 1.0 1.0 1.0 1.0 1.0
SHIFTED DDD 1.0 1.0 1.0 1.0 1.0
RESULT 1.0 1.0 1.0 1.0 1.0
In [35]:
corr['RESULT']
Out[35]:
SHIFTED AAA    1.0
SHIFTED BBB    1.0
SHIFTED CCC    1.0
SHIFTED DDD    1.0
RESULT         1.0
Name: RESULT, dtype: float64

As we see, independent variables are perfectly correlated with result variable. This phenomenon was hidden because there were existing shifts.
I hope I convinced that researchers should enter rule of checking shifts during model making.

Artykuł Example of the use of shift for linear regression in Python. How to find optimal correlation shift? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
(Part 1) Process of forecasting the output electricity for power plant working in combined cycle by the linear regression equation https://sigmaquality.pl/uncategorized/part-1-process-of-forecasting-the-output-electricity-for-power-plant-working-in-combined-cycle-by-the-linear-regression-equation/ Fri, 27 Sep 2019 19:23:00 +0000 http://sigmaquality.pl/part-1-process-of-forecasting-the-output-electricity-for-power-plant-working-in-combined-cycle-by-the-linear-regression-equation/ Today we find out how to create simple regression model to predict level of output electricity in power plant works in combined cycle. Źródło bazy [...]

Artykuł (Part 1) Process of forecasting the output electricity for power plant working in combined cycle by the linear regression equation pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Today we find out how to create simple regression model to predict level of output electricity in power plant works in combined cycle.

Źródło bazy danych: https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Registry contains 9568 measurements made in the period between 2006 and 2011. During the survey power plant was working in full capacity. Registry contains following of process characteristic:

  • Temperature (T) in the range 1.81°C and 37.11°C,
  • Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
  • Relative Humidity (RH) in the range 25.56
  • Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
  • Net hourly electrical energy output (EP) 420.26-495.76 MW

A combined cycle power plant (CCPP) is composed of gas turbines (GT), steam turbines (ST) and heat recovery steam generators. In a CCPP, the electricity is generated by gas and steam turbines, which are combined in one cycle, and is transferred from one turbine to another. While the Vacuum is colected from and has effect on the Steam Turbine, he other three of the ambient variables effect the GT performance.

Process of forecasting by the linear regression model

We open database and useful libraries.

In [1]:
import pandas as pd
import numpy as np
import itertools
from itertools import chain, combinations
import statsmodels.formula.api as smf
import scipy.stats as scipystats
import statsmodels.api as sm
import statsmodels.stats.stattools as stools
import statsmodels.stats as stats
from statsmodels.graphics.regressionplots import *
import matplotlib.pyplot as plt
import seaborn as sns
import copy
import math

## https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

df = pd.read_excel('c:/1/Folds5x2_pp.xlsx')
df.sample(5)
Out[1]:
AT V AP RH PE
5844 20.76 59.21 1017.94 80.84 447.67
8410 7.54 38.56 1016.49 69.10 486.76
7916 14.73 58.20 1019.45 82.48 458.56
2253 14.56 40.69 1015.48 73.73 469.31
6735 14.57 37.86 1022.44 68.82 463.58

We analyze size and parameters of registry.

In [2]:
df.shape
Out[2]:
(9568, 5)
In [3]:
df.dtypes
Out[3]:
AT    float64
V     float64
AP    float64
RH    float64
PE    float64
dtype: object

For improve transparency of code we changed names of columns.

In [4]:
df.columns = ['Temperature', 'Exhaust_Vacuum', 'Ambient_Pressure', 'Relative_Humidity', 'Energy_output']
df.sample(5)
Out[4]:
Temperature Exhaust_Vacuum Ambient_Pressure Relative_Humidity Energy_output
9540 23.53 50.05 1005.63 78.40 443.71
9270 7.49 39.81 1018.20 85.96 483.71
163 21.87 61.45 1011.13 92.22 445.21
3479 21.63 50.16 1005.70 84.84 448.32
774 26.28 75.23 1011.44 68.35 435.28

Base conditions to realize linear regression model

Failure to comply of these rules lead to wrong, misleading result of model calculation.

  1. There are no correlation between descriptive variables (independent variables)
  2. There are exist linear relationship between predictive and result variables.
  3. The errors from the model should have normal distribution.
  4. Homogeneous of error variance
  5. Lack of correlation between errors. Errors from one observation should be independent to errors from other observation.

Analyze of relationship between dependent and independent variables

In [5]:
CORREL = df.corr().sort_values('Energy_output')
CORREL['Energy_output']
Out[5]:
Temperature         -0.948128
Exhaust_Vacuum      -0.869780
Relative_Humidity    0.389794
Ambient_Pressure     0.518429
Energy_output        1.000000
Name: Energy_output, dtype: float64

There are high level of correlation between result variable: energy output (EP) and exogenic factors: Temperature (T) and Exhaust Vacuum (V).

One factor regression model

In [6]:
lm = smf.ols(formula = 'Energy_output ~ Temperature', data = df).fit()
lm.summary()
Out[6]:
OLS Regression Results
Dep. Variable: Energy_output R-squared: 0.899
Model: OLS Adj. R-squared: 0.899
Method: Least Squares F-statistic: 8.510e+04
Date: Fri, 06 Sep 2019 Prob (F-statistic): 0.00
Time: 13:10:52 Log-Likelihood: -29756.
No. Observations: 9568 AIC: 5.952e+04
Df Residuals: 9566 BIC: 5.953e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 497.0341 0.156 3177.280 0.000 496.727 497.341
Temperature -2.1713 0.007 -291.715 0.000 -2.186 -2.157
Omnibus: 417.457 Durbin-Watson: 2.033
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1117.844
Skew: -0.209 Prob(JB): 1.83e-243
Kurtosis: 4.621 Cond. No. 59.4

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

One factor regression model has a very good predictive properties.

In [7]:
plt.figure()
plt.scatter(df.Temperature, df.Energy_output, c = 'grey')
plt.plot(df.Temperature, lm.params[0] + lm.params[1] * df.Temperature, c = 'r')
plt.xlabel('Temperature')
plt.ylabel('Energy_output')
plt.title("Linear Regression Plot")
Out[7]:
Text(0.5, 1.0, 'Linear Regression Plot')

Multi factor regression model

In [8]:
lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
print (lm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:          Energy_output   R-squared:                       0.929
Model:                            OLS   Adj. R-squared:                  0.929
Method:                 Least Squares   F-statistic:                 3.114e+04
Date:                Fri, 06 Sep 2019   Prob (F-statistic):               0.00
Time:                        13:10:53   Log-Likelihood:                -28088.
No. Observations:                9568   AIC:                         5.619e+04
Df Residuals:                    9563   BIC:                         5.622e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept           454.6093      9.749     46.634      0.000     435.500     473.718
Temperature          -1.9775      0.015   -129.342      0.000      -2.007      -1.948
Exhaust_Vacuum       -0.2339      0.007    -32.122      0.000      -0.248      -0.220
Relative_Humidity    -0.1581      0.004    -37.918      0.000      -0.166      -0.150
Ambient_Pressure      0.0621      0.009      6.564      0.000       0.044       0.081
==============================================================================
Omnibus:                      892.002   Durbin-Watson:                   2.033
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             4086.777
Skew:                          -0.352   Prob(JB):                         0.00
Kurtosis:                       6.123   Cond. No.                     2.13e+05
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.13e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Multi factor regression model have very good descriptive features. It raises doubts.

Analyze of variables used in mode

In [9]:
df.describe()
Out[9]:
Temperature Exhaust_Vacuum Ambient_Pressure Relative_Humidity Energy_output
count 9568.000000 9568.000000 9568.000000 9568.000000 9568.000000
mean 19.651231 54.305804 1013.259078 73.308978 454.365009
std 7.452473 12.707893 5.938784 14.600269 17.066995
min 1.810000 25.360000 992.890000 25.560000 420.260000
25 13.510000 41.740000 1009.100000 63.327500 439.750000
50 20.345000 52.080000 1012.940000 74.975000 451.550000
75 25.720000 66.540000 1017.260000 84.830000 468.430000
max 37.110000 81.560000 1033.300000 100.160000 495.760000

Analyze of distribution of result variable

In [10]:
plt.rcParams['figure.figsize'] = (5, 4)
sns.distplot(df['Energy_output'])    
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0xcb106d8>

Analysis of correlation between independent variables

In [11]:
plt.rcParams['figure.figsize'] = (5, 4)
sns.heatmap (df.corr (), cmap="YlGnBu")
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0xcb03c18>

We can easily remark that between independent variables: Temperature (T) and Exhaust Vacuum (V) exist very high positive correlation.
Another form of correlation matrix.

In [12]:
sns.heatmap (df.corr (), cmap="coolwarm", annot=True, cbar=False)
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xd5ae438>

We check completeness of survey.

In [13]:
df.isnull().sum()
Out[13]:
Temperature          0
Exhaust_Vacuum       0
Ambient_Pressure     0
Relative_Humidity    0
Energy_output        0
dtype: int64

Graphical presentation of relationship between independent and dependent variables

We divide result variable into two categories: „High power” and „Low power”. We do it to remark what descriptive variables work.

In [14]:
Ewa = ['High power', 'Low power']

df['Power'] = pd.qcut(df['Energy_output'],2, labels=Ewa)
df.sample(4)
Out[14]:
Temperature Exhaust_Vacuum Ambient_Pressure Relative_Humidity Energy_output Power
2743 30.20 73.67 1006.31 62.14 428.72 High power
1095 30.33 68.67 1006.00 54.99 435.53 High power
4327 9.15 41.82 1032.88 75.11 477.78 Low power
3457 32.74 68.31 1010.23 41.29 441.66 High power

We create graphical relationship matrix.

In [15]:
sns.pairplot(data=df[['Temperature' ,'Exhaust_Vacuum','Ambient_Pressure', 'Relative_Humidity', 'Power']], hue='Power', dropna=True, height=2)
Out[15]:
<seaborn.axisgrid.PairGrid at 0xd633240>

Relationship among Temperature (T) and Exhaust Vacuum (V) ones again showed high level of correlation.

In [16]:
sns.jointplot(x='Temperature', y='Exhaust_Vacuum', data=df)
Out[16]:
<seaborn.axisgrid.JointGrid at 0xb9946d8>

Artykuł (Part 1) Process of forecasting the output electricity for power plant working in combined cycle by the linear regression equation pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
(Part 2) Process of forecasting the output electricity for power plant working in combined cycle by the linear regression equation https://sigmaquality.pl/uncategorized/part-2-process-of-forecasting-the-output-electricity-for-power-plant-working-in-combined-cycle-by-the-linear-regression-equation/ Fri, 27 Sep 2019 19:23:00 +0000 http://sigmaquality.pl/part-2-process-of-forecasting-the-output-electricity-for-power-plant-working-in-combined-cycle-by-the-linear-regression-equation/ Correctly carried out linear regression model is the mapping of reality. It can be used to diagnostic systems, to prevent failures or even installation stoppage. [...]

Artykuł (Part 2) Process of forecasting the output electricity for power plant working in combined cycle by the linear regression equation pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Correctly carried out linear regression model is the mapping of reality. It can be used to diagnostic systems, to prevent failures or even installation stoppage. Monitoring of the production process by the regression model is effective form of prevention of problems.

In the previous part of this publication I showed how to compose linear regression model. Now we will concentrate on variables analyses.
Let’s remind fundamental rules of linear regression model.

Failure to comply of these rules lead to wrong, misleading result of model calculation.

  1. There are no correlation between descriptive variables (independent variables)
  2. There are exist linear relationship between predictive and result variables.
  3. The errors from the model should have normal distribution.
  4. Homogeneous of error variance
  5. Lack of correlation between errors. Errors from one observation should be independent to errors from other observation.

We load needed libraries and Database.

Parameters of multi factor regression model

We load needed libraries and Database.

In [3]:
import pandas as pd
import numpy as np
import itertools
from itertools import chain, combinations
import statsmodels.formula.api as smf
import scipy.stats as scipystats
import statsmodels.api as sm
import statsmodels.stats.stattools as stools
import statsmodels.stats as stats 
from statsmodels.graphics.regressionplots import *
import matplotlib.pyplot as plt
import seaborn as sns
import copy
import math
## https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

df = pd.read_excel('c:/1/Folds5x2_pp.xlsx')

df.columns = ['Temperature', 'Exhaust_Vacuum', 'Ambient_Pressure', 'Relative_Humidity', 'Energy_output']

We create multi factor regression model.

In [4]:
lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
lm.summary()
Out[4]:
OLS Regression Results
Dep. Variable: Energy_output R-squared: 0.929
Model: OLS Adj. R-squared: 0.929
Method: Least Squares F-statistic: 3.114e+04
Date: Fri, 06 Sep 2019 Prob (F-statistic): 0.00
Time: 10:01:48 Log-Likelihood: -28088.
No. Observations: 9568 AIC: 5.619e+04
Df Residuals: 9563 BIC: 5.622e+04
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 454.6093 9.749 46.634 0.000 435.500 473.718
Temperature -1.9775 0.015 -129.342 0.000 -2.007 -1.948
Exhaust_Vacuum -0.2339 0.007 -32.122 0.000 -0.248 -0.220
Relative_Humidity -0.1581 0.004 -37.918 0.000 -0.166 -0.150
Ambient_Pressure 0.0621 0.009 6.564 0.000 0.044 0.081
Omnibus: 892.002 Durbin-Watson: 2.033
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4086.777
Skew: -0.352 Prob(JB): 0.00
Kurtosis: 6.123 Cond. No. 2.13e+05

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.13e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

In [5]:
coef = lm.params.to_dict()

Now we can recall any coefficient of descriptive variables.

In [6]:
coef['Temperature']
Out[6]:
-1.977513106635385
In [7]:
coef['Intercept']
Out[7]:
454.6092743153056

Displaying the rests (errors) of the model

The Results from the model (the error of the prediction) it is different between empirical values and their theoretical prediction come from the regression model.

In [8]:
lm.resid.sample(6)
Out[8]:
2922    2.484973
5860    4.214192
2534   -7.031075
5202   -3.479953
3076   -5.572512
7444    4.709399
dtype: float64

Displaying the result of model prediction

We generate predicted values from the model.

In [9]:
lm.predict()
Out[9]:
array([467.26978996, 444.0773659 , 483.56264263, ..., 432.40579787,
       443.03667582, 449.69603741])

We create additional columns in Dataframe who contain prediction and values of rests, that is the difference between empirical values and their estimation.

In [10]:
df['Predict'] = pd.Series(lm.predict())
df['Resid'] = pd.Series(lm.resid)
df.sample(8)[['Energy_output','Predict','Resid']]
Out[10]:
Energy_output Predict Resid
4620 463.68 468.239190 -4.559190
4330 489.03 481.338666 7.691334
320 455.28 450.579079 4.700921
9108 449.93 450.440091 -0.510091
991 446.80 450.080389 -3.280389
8617 447.22 450.579220 -3.359220
9247 442.73 445.176771 -2.446771
3171 425.14 430.794283 -5.654283

Analyze of normal distribution of the rest from the model

In [11]:
sns.kdeplot(np.array(df['Resid']), bw=10)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0xc705ac8>

On this plot of normal distribution is worrying the left tail of the plot, is to long.

In [12]:
sns.distplot(np.array(df['Resid']))
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0xbe69550>

Anderson-Darling normality test

Test Anderson-Darling was elaborated by Teodora Andersona i Donalda Darling in 1952 year. Distribution is normal when cover the diagonal line.

In [15]:
sm.qqplot(df['Resid'],color='r')
Out[15]:
In [16]:
import pylab
scipystats.probplot(df['Resid'], dist="norm", plot=pylab)
Out[16]:
((array([-3.79967944, -3.57391754, -3.44994448, ...,  3.44994448,
          3.57391754,  3.79967944]),
  array([-43.43539373, -43.03584359, -38.35687614, ...,  16.8007967 ,
          17.42287291,  17.77771926])),
 (4.509644478448458, 4.154557260475787e-12, 0.989212398876157))

We calculate value of the Anderson-Darling gauge.

In [18]:
import scipy
scipy.stats.anderson(df['Resid'], dist='norm' ) 
Out[18]:
AndersonResult(statistic=9.20901667254293, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

We obtain statistics [15. , 10. , 5. , 2.5, 1. ].
If obtained statistic (statistic=9.20901667254293) is higher than critical parameters ( critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092])), then for appropriate level of significance (significance_level=array([15. , 10. , 5. , 2.5, 1. ])), we able to reject zero hypothesis that say that it is normal distribution.

Anderson-Darling gauge says that the rest distribution have no normal character.

Kołmogorowa-Smirnowa test

In [19]:
from scipy import stats
stats.kstest(df['Resid'], 'norm')
Out[19]:
KstestResult(statistic=0.3311386887701732, pvalue=0.0)

Value of statistic is lower than p-value, that point the distribution of model rests are not normal distribution.

We made comparison the realisation and prediction.

In [22]:
sns.kdeplot(np.array(df['Energy_output']), bw=10)
sns.distplot(np.array(df['Predict']))
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0xb812198>

Test for homogeneity of the errors variance

One of the principles of regression is to stable, homogeneity values of model rests. When model is good for the variances should be homogeneous.

In [24]:
resid = lm.resid
plt.scatter(lm.predict(), resid)
Out[24]:
<matplotlib.collections.PathCollection at 0xba846a0>

The plot shows that part of errors have big dispersion, variance are not entirely homogeneous.

Test for autocorrelation of the rests from the model

One of the fundamental principles of the linear regression model is independence of the errors of observations. The rest from the model should be not correlated among. Good fitted model assumed when the rests are independent among, their distribution is random, without any pattern.

One of the method to check random character of rests distribution is test r-Pearsona. Test check autocorrelation among errors. To this check Durbina-Watsona statistic is used. Value for ours model obtained in standard information sheet is 2.

Source of knowledge: http://www.naukowiec.org/wiedza/statystyka/test-durbina-watsona-niezaleznosc-bledow-obserwacji_423.html

In [26]:
import statsmodels
statsmodels.stats.stattools.durbin_watson(df['Resid'], axis=0)
Out[26]:
2.0329358073274766

Statistic of the DW test is equal 2 (we got this information also from standard information sheet of the linear regression model). The DW test has from 0 to 4 range When value is going to 4 exist negativ autocorrelation, when statistic come to 0, there are positive correlations among rests. Value 2 is pointing that no exist autocorrelation among errors.

Test of collinearity among predictor variables

One of the fundamental principles of regression is lack of collinearity among independent variables. When two predictors are strongly among correlated, one of the variable loses their predictive power, loses to second predictor. Strong correlation among predictors lead to deterioration of the model parameters.

We found out in previous part of publication that exist big correlation between two variables: Temperature (T) and Exhaust Vacuum (V). Now appear question which variable we would like to remove from the model. Good tool is the factor VIF (Variation Inflation Factor). Thanks to this factor we able to point which variable should be eliminated from the model.

In [27]:
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor

lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
y, X = dmatrices('Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df, return_type = "dataframe")
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
[43761.15186572733, 5.9776016741918685, 3.9430030401259746, 1.7052900320740478, 1.4526394006715817]

Result in form of vector represent variable in the certain order as in model. Recommendation of the VIF pointed, that if factor assigned to variable is more than 5, this variable is highly correlated with another variables and should be eliminated from the model.

Test showed that Temperature (T) variable should be removed from the model. This is confirmation of our conclusion from firs part of investigation, that there are existed detrimental collinearity between predictors.

Ending conclusion

To effective use regression model to describe and next monitor production processes we need to deeply check of meet of fundamental principles of regression. This time we found our, we ought to remove one of the factor because exist predictor collinearity.

After eliminate this variable all work should be done ones again.

It is possible to utilize one factor model using highly correlated variable with the result variable.

Artykuł (Part 2) Process of forecasting the output electricity for power plant working in combined cycle by the linear regression equation pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>