In the previous part of this publication I showed how to compose linear regression model. Now we will concentrate on variables analyses.
Let’s remind fundamental rules of linear regression model.
Failure to comply of these rules lead to wrong, misleading result of model calculation.
- There are no correlation between descriptive variables (independent variables)
- There are exist linear relationship between predictive and result variables.
- The errors from the model should have normal distribution.
- Homogeneous of error variance
- Lack of correlation between errors. Errors from one observation should be independent to errors from other observation.
We load needed libraries and Database.
Parameters of multi factor regression model
We load needed libraries and Database.
import pandas as pd
import numpy as np
import itertools
from itertools import chain, combinations
import statsmodels.formula.api as smf
import scipy.stats as scipystats
import statsmodels.api as sm
import statsmodels.stats.stattools as stools
import statsmodels.stats as stats
from statsmodels.graphics.regressionplots import *
import matplotlib.pyplot as plt
import seaborn as sns
import copy
import math
## https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
df = pd.read_excel('c:/1/Folds5x2_pp.xlsx')
df.columns = ['Temperature', 'Exhaust_Vacuum', 'Ambient_Pressure', 'Relative_Humidity', 'Energy_output']
We create multi factor regression model.
lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
lm.summary()
coef = lm.params.to_dict()
Now we can recall any coefficient of descriptive variables.
coef['Temperature']
coef['Intercept']
Displaying the rests (errors) of the model
The Results from the model (the error of the prediction) it is different between empirical values and their theoretical prediction come from the regression model.
lm.resid.sample(6)
Displaying the result of model prediction
We generate predicted values from the model.
lm.predict()
We create additional columns in Dataframe who contain prediction and values of rests, that is the difference between empirical values and their estimation.
df['Predict'] = pd.Series(lm.predict())
df['Resid'] = pd.Series(lm.resid)
df.sample(8)[['Energy_output','Predict','Resid']]
Analyze of normal distribution of the rest from the model
sns.kdeplot(np.array(df['Resid']), bw=10)
On this plot of normal distribution is worrying the left tail of the plot, is to long.
sns.distplot(np.array(df['Resid']))
Anderson-Darling normality test
Test Anderson-Darling was elaborated by Teodora Andersona i Donalda Darling in 1952 year. Distribution is normal when cover the diagonal line.
sm.qqplot(df['Resid'],color='r')
import pylab
scipystats.probplot(df['Resid'], dist="norm", plot=pylab)
We calculate value of the Anderson-Darling gauge.
import scipy
scipy.stats.anderson(df['Resid'], dist='norm' )
We obtain statistics [15. , 10. , 5. , 2.5, 1. ].
If obtained statistic (statistic=9.20901667254293) is higher than critical parameters ( critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092])), then for appropriate level of significance (significance_level=array([15. , 10. , 5. , 2.5, 1. ])), we able to reject zero hypothesis that say that it is normal distribution.
Anderson-Darling gauge says that the rest distribution have no normal character.
Kołmogorowa-Smirnowa test
from scipy import stats
stats.kstest(df['Resid'], 'norm')
Value of statistic is lower than p-value, that point the distribution of model rests are not normal distribution.
We made comparison the realisation and prediction.
sns.kdeplot(np.array(df['Energy_output']), bw=10)
sns.distplot(np.array(df['Predict']))
Test for homogeneity of the errors variance
One of the principles of regression is to stable, homogeneity values of model rests. When model is good for the variances should be homogeneous.
resid = lm.resid
plt.scatter(lm.predict(), resid)
The plot shows that part of errors have big dispersion, variance are not entirely homogeneous.
Test for autocorrelation of the rests from the model
One of the fundamental principles of the linear regression model is independence of the errors of observations. The rest from the model should be not correlated among. Good fitted model assumed when the rests are independent among, their distribution is random, without any pattern.
One of the method to check random character of rests distribution is test r-Pearsona. Test check autocorrelation among errors. To this check Durbina-Watsona statistic is used. Value for ours model obtained in standard information sheet is 2.
Source of knowledge: http://www.naukowiec.org/wiedza/statystyka/test-durbina-watsona-niezaleznosc-bledow-obserwacji_423.html
import statsmodels
statsmodels.stats.stattools.durbin_watson(df['Resid'], axis=0)
Statistic of the DW test is equal 2 (we got this information also from standard information sheet of the linear regression model). The DW test has from 0 to 4 range When value is going to 4 exist negativ autocorrelation, when statistic come to 0, there are positive correlations among rests. Value 2 is pointing that no exist autocorrelation among errors.
Test of collinearity among predictor variables
One of the fundamental principles of regression is lack of collinearity among independent variables. When two predictors are strongly among correlated, one of the variable loses their predictive power, loses to second predictor. Strong correlation among predictors lead to deterioration of the model parameters.
We found out in previous part of publication that exist big correlation between two variables: Temperature (T) and Exhaust Vacuum (V). Now appear question which variable we would like to remove from the model. Good tool is the factor VIF (Variation Inflation Factor). Thanks to this factor we able to point which variable should be eliminated from the model.
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
lm = smf.ols(formula = 'Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df).fit()
y, X = dmatrices('Energy_output ~ Temperature + Exhaust_Vacuum + Relative_Humidity + Ambient_Pressure', data = df, return_type = "dataframe")
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
Result in form of vector represent variable in the certain order as in model. Recommendation of the VIF pointed, that if factor assigned to variable is more than 5, this variable is highly correlated with another variables and should be eliminated from the model.
Test showed that Temperature (T) variable should be removed from the model. This is confirmation of our conclusion from firs part of investigation, that there are existed detrimental collinearity between predictors.
Ending conclusion
To effective use regression model to describe and next monitor production processes we need to deeply check of meet of fundamental principles of regression. This time we found our, we ought to remove one of the factor because exist predictor collinearity.
After eliminate this variable all work should be done ones again.
It is possible to utilize one factor model using highly correlated variable with the result variable.