LINEAR REGRESSION DIAGNOSTICS part 1.

EN301220191250
In [1]:
from statsmodels.compat import lzip
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
In [2]:
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)
Out[2]:
Unnamed: 0 Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 0 10/03/2004 18.00.00 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
1 1 10/03/2004 19.00.00 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255
2 2 10/03/2004 20.00.00 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502

Checking the completeness of the data and their format

In [3]:
df.dtypes
Out[3]:
Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object
In [4]:
df.isnull().sum()
Out[4]:
Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

I put the research hypothesis

  • H0 is not statistically significant between the increase in PT08.S4 (NO2) contamination and atmospheric pressure and temperature
  • H1 there is a statistically significant correlation between the increase in PT08.S4 (NO2) contamination and atmospheric temperature and pressure

Checking correlation

Because I have continuous data I will check the correlation using the person coefficient.

In [5]:
import scipy

print("nAssociation between PT08.S4(NO2) and temperature: ")
PKP = scipy.stats.pearsonr(df['PT08.S4(NO2)'], df['T'])
PKP = np.round(PKP, decimals=4)
print(PKP)
Association between PT08.S4(NO2) and temperature: 
[0.5584 0.    ]
In [6]:
print("nAssociation between PT08.S4(NO2) and atmospheric pressure: ")
PKP = scipy.stats.pearsonr(df['PT08.S4(NO2)'], df['AH'])
PKP = np.round(PKP, decimals=4)
print(PKP)
Association between PT08.S4(NO2) and atmospheric pressure: 
[0.6303 0.    ]

In both cases, a high statistical correlation (p <0.01) with impurity PT08.S4 (NO2). Knowing the weather forecast (future temperature and atmospheric pressure), we can predict the level of contamination with PT08.S4 (NO2) compounds. I calculate r2 for forecasting based on temperature correlation and based on correlation with atmospheric pressure.

In [7]:
print ("Predictability of PT08.S4 (NO2) variability based on temperature:", 0.5584 * 0.5584)
Predictability of PT08.S4 (NO2) variability based on temperature: 0.31181056
In [8]:
print ("Predictability of PT08.S4 (NO2) variability based on atmospheric pressure:", 0.6303 * 0.6303)
Predictability of PT08.S4 (NO2) variability based on atmospheric pressure: 0.39727809

We are creating a linear regression model

In [9]:
X = df[['AH','T']]
y = df['PT08.S4(NO2)']

model = sm.OLS(y, sm.add_constant(X))
model_fit = model.fit()

print(model_fit.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           PT08.S4(NO2)   R-squared:                       0.434
Model:                            OLS   Adj. R-squared:                  0.434
Method:                 Least Squares   F-statistic:                     3592.
Date:                Mon, 30 Dec 2019   Prob (F-statistic):               0.00
Time:                        12:47:40   Log-Likelihood:                -65354.
No. Observations:                9357   AIC:                         1.307e+05
Df Residuals:                    9354   BIC:                         1.307e+05
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        864.5223      7.468    115.767      0.000     849.884     879.161
AH           397.6058      8.829     45.032      0.000     380.298     414.913
T             10.0438      0.405     24.789      0.000       9.250      10.838
==============================================================================
Omnibus:                     1306.783   Durbin-Watson:                   0.293
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2042.908
Skew:                           0.982   Prob(JB):                         0.00
Kurtosis:                       4.175   Cond. No.                         77.3
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
In [10]:
dfM = pd.concat([X, y], axis=1)

Residuals vs Fitted plot

It answers the question whether the linear model well reflects the variability of the process.

In [11]:
# model values
model_y = model_fit.fittedvalues
In [12]:
plt.style.use('seaborn')

plt.rc('font', size=14)
plt.rc('figure', titlesize=18)
plt.rc('axes', labelsize=15)
plt.rc('axes', titlesize=18)
In [13]:
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
import seaborn as sns

pl_A = plt.figure()
pl_A.axes[0] = sns.residplot(model_y, dfM.columns[-1], data=dfM,
                          lowess=True,
                          scatter_kws={'color': 'grey','alpha': 0.4},
                          line_kws={'color': 'yellow', 'lw': 1, 'alpha': 1.0})

pl_A.axes[0].set_title('Residuals vs Fitted')
pl_A.axes[0].set_xlabel('Fitted values')
pl_A.axes[0].set_ylabel('Residuals')
Out[13]:
Text(0, 0.5, 'Residuals')

One of the assumptions of the OLS model is the ability to describe the course of data using a straight line. If this assumption is true, we should have a relatively flat line, looking at the residual values relative to the matched points (yellow line). The ideal Residuals vs Fitted scatterplot when there are no visible patterns and the yellow line will be straight and horizontal (random noise).
The arc-shaped line shows that the model did not capture some nonlinear elements. These nonlinear features did not match the model. Perhaps the data variability can be better captured using an exponential model or other non-linear transformation obtained using synthetic variables.

Normal Q-Q Plot

If the red points are on a straight line, it means that the standarized residuals has a normal distribution.

In [14]:
model_residuals = model_fit.resid
model_norm_residuals = model_fit.get_influence().resid_studentized_internal
In [15]:
QQ = ProbPlot(model_norm_residuals)
pl_B = QQ.qqplot(line='45', alpha=0.5, color='grey', lw=1)
pl_B.axes[0].set_title('Normal Q-Q')
pl_B.axes[0].set_xlabel('Theoretical Quantiles')
pl_B.axes[0].set_ylabel('Standardized Residuals');

abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]
for r, i in enumerate(abs_norm_resid_top_3):
    pl_B.axes[0].annotate(i,
                               xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
                                   model_norm_residuals[i]))

The location of the points indicates that the standardized residual values of the model (standardized residuals of the model) do not have a perfect normal distribution. If the points are not on the red line – it is far from the red line – it means that the standardized residues are not normally distributed (the model has “heavy tails”).
If standardized residues do not have a normal distribution, the model can generate extreme results.

Let’s check which data load normal distribution

In [16]:
dfM.iloc[1610:1690, :].plot(kind='bar',figsize=(16,2))
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x22e708e1da0>

Clearly, observation 1646, indicated by the Normal Q-Q Plot chart, significantly differs from the other data. To improve the quality of the regression model, outliers should be discarded from the data.

Scale-Location plot

The chart checks for inconsistencies in variance or heteroscedasticity. According to the homoscedastic principle, each probability distribution for y (result variable) has the same standard deviation, regardless of the value of x (predictor).

In [17]:
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
model_abs_resid = np.abs(model_residuals)
In [18]:
pl_F = plt.figure()
plt.scatter(model_y, model_norm_residuals_abs_sqrt, color='grey', alpha=0.5);
sns.regplot(model_y, model_norm_residuals_abs_sqrt,
              scatter=False,
              ci=False,
              lowess=True,
              line_kws={'color': 'yellow', 'lw': 1, 'alpha': 1.0});
pl_F.axes[0].set_title('Scale-Location')
pl_F.axes[0].set_xlabel('Fitted values')
pl_F.axes[0].set_ylabel('$sqrt{|Standardized Residuals|}

The more horizontal the yellow line, the more likely the data is homoscedastic.
The heteroscedastic chart can be V-shaped, which means that compared to the middle section there are higher values on the left and right of the chart. This may be due to the non-disclosure of non-linearity in the model and this should be further investigated. The two most common methods of “consolidating” heteroscedasticity are:

  • use of the least squares weighing method or
  • use of heteroscedastically corrected covariance matrix (hccm).

The next graph indicated observation number 1646 as a segment disrupting the linear regression model.

Residuals vs Leverage plot

When standardized residues do not have a normal distribution, extreme values of y results may occur. In the case of high leverage points, extremely independent x variables may appear. Extreme x seems to be so bad, but may have a detrimental effect on the model because the coefficients at x or β are very sensitive to leverage points. The purpose of the Residuals vs Leverage chart is to identify these problematic observations.

In [19]:
model_leverage = model_fit.get_influence().hat_matrix_diag
model_cooks = model_fit.get_influence().cooks_distance[0]
In [20]:
pl_C = plt.figure(figsize=(16,4));
plt.scatter(model_leverage, model_norm_residuals, color='grey', alpha=0.3);
sns.regplot(model_leverage, model_norm_residuals,
              scatter=False,
              ci=False,
              lowess=True,
              line_kws={'color': 'blue', 'lw': 1, 'alpha': 1.0});
pl_C.axes[0].set_xlim(0, max(model_leverage)+0.01)
pl_C.axes[0].set_ylim(-3, 5)
pl_C.axes[0].set_title('Residuals vs Leverage')
pl_C.axes[0].set_xlabel('Leverage')
pl_C.axes[0].set_ylabel('Standardized Residuals');


leverage_top_3 = np.flip(np.argsort(model_cooks), 0)[:3]
for i in leverage_top_3:
    pl_C.axes[0].annotate(i,
                                xy=(model_leverage[i],
                                    model_norm_residuals[i]));

Thanks to the Cook distance, we only need to find leverage points that have a distance greater than 0.5. These 0.5 are shown on the graph as a dashed curve at the top and bottom – shown when such outliers occur. We do not have any leverage points in this chart that are outside the 0.5 curve. Therefore, there are no outliers in the top right or bottom right of the chart.
The procedure with the Residuals vs Leverage chart is that outliers are removed from independent variables and the model is rebuilt. This procedure improves model properties.

The chart showed observation number 5690. Let’s see what the data is.

In [21]:
dfM.iloc[5660:5710, :].plot(kind='bar',figsize=(16,2), legend=False)
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x22e71f28cf8>

Influence plot

The impact graph shows the residual values of the model as a function of the lever of each observation measured with the hat matrix. Externally learned residual values are scaled according to their standard deviation
Two impact measures are available: Cook and DFFITS.

In [22]:
fig, ax = plt.subplots(figsize=(18,3))
fig = sm.graphics.influence_plot(model_fit, ax=ax, criterion="cooks")

Interpretation

The size of the bubbles (in our chart you can not see it) means the size of the cook distance, the larger the bubble, the greater the cook parameter.

DFFITS

DFFITS is a diagnostic that is intended to show how much impact a point in the statistical regression proposed in 1980 has [1] It is defined as student DFFIT, where the latter is a change in the predicted value for the point obtained when this point is left outside the regression .

Cooks

is a commonly used estimation of the data point impact when performing the least squares regression analysis. [1] In practical ordinary least squares analysis, the Cook distance can be used in several ways:

  • to identify influential data points that are particularly worth checking for validity; or
  • indicate areas of the design space in which it would be good to obtain more data points.

This means that Cooks distance measures the impact of each observation in the model or “what would happen if each observation were not in the model,” and this is important because it is one way to detect outliers that particularly affects the regression line. When we are not looking for and treating potential outliers in our data, it is possible that the corrected coefficients for the model may not be the most representative or appropriate, which may lead to incorrect inferences.

The hat values

Hat values ​​are fitted values ​​or predictions made by the model for each observation. Completely different from Cook’s distance.

H levarage

H levarage measures how each input parameter X affects the fit model. In contrast, the Cook distance also includes the influence of the output parameter y.