```
from statsmodels.compat import lzip
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
```

```
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)
```

#### Checking the completeness of the data and their format

```
df.dtypes
```

```
df.isnull().sum()
```

### I put the research hypothesis

- H0 is not statistically significant between the increase in PT08.S4 (NO2) contamination and atmospheric pressure and temperature
- H1 there is a statistically significant correlation between the increase in PT08.S4 (NO2) contamination and atmospheric temperature and pressure

### Checking correlation ¶

Because I have continuous data I will check the correlation using the person coefficient.

```
import scipy
print("nAssociation between PT08.S4(NO2) and temperature: ")
PKP = scipy.stats.pearsonr(df['PT08.S4(NO2)'], df['T'])
PKP = np.round(PKP, decimals=4)
print(PKP)
```

```
print("nAssociation between PT08.S4(NO2) and atmospheric pressure: ")
PKP = scipy.stats.pearsonr(df['PT08.S4(NO2)'], df['AH'])
PKP = np.round(PKP, decimals=4)
print(PKP)
```

In both cases, a high statistical correlation (p <0.01) with impurity PT08.S4 (NO2). Knowing the weather forecast (future temperature and atmospheric pressure), we can predict the level of contamination with PT08.S4 (NO2) compounds. I calculate r2 for forecasting based on temperature correlation and based on correlation with atmospheric pressure.

```
print ("Predictability of PT08.S4 (NO2) variability based on temperature:", 0.5584 * 0.5584)
```

```
print ("Predictability of PT08.S4 (NO2) variability based on atmospheric pressure:", 0.6303 * 0.6303)
```

### We are creating a linear regression model

```
X = df[['AH','T']]
y = df['PT08.S4(NO2)']
model = sm.OLS(y, sm.add_constant(X))
model_fit = model.fit()
print(model_fit.summary())
```

```
dfM = pd.concat([X, y], axis=1)
```

### Residuals vs Fitted plot

It answers the question whether the linear model well reflects the variability of the process.

```
# model values
model_y = model_fit.fittedvalues
```

```
plt.style.use('seaborn')
plt.rc('font', size=14)
plt.rc('figure', titlesize=18)
plt.rc('axes', labelsize=15)
plt.rc('axes', titlesize=18)
```

```
import statsmodels.api as sm
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
import seaborn as sns
pl_A = plt.figure()
pl_A.axes[0] = sns.residplot(model_y, dfM.columns[-1], data=dfM,
lowess=True,
scatter_kws={'color': 'grey','alpha': 0.4},
line_kws={'color': 'yellow', 'lw': 1, 'alpha': 1.0})
pl_A.axes[0].set_title('Residuals vs Fitted')
pl_A.axes[0].set_xlabel('Fitted values')
pl_A.axes[0].set_ylabel('Residuals')
```

One of the assumptions of the OLS model is the ability to describe the course of data using a straight line. If this assumption is true, we should have a relatively flat line, looking at the residual values relative to the matched points (yellow line). The ideal Residuals vs Fitted scatterplot when there are no visible patterns and the yellow line will be straight and horizontal (random noise).

The arc-shaped line shows that the model did not capture some nonlinear elements. These nonlinear features did not match the model. Perhaps the data variability can be better captured using an exponential model or other non-linear transformation obtained using synthetic variables.

### Normal Q-Q Plot¶

If the red points are on a straight line, it means that the standarized residuals has a normal distribution.

```
model_residuals = model_fit.resid
model_norm_residuals = model_fit.get_influence().resid_studentized_internal
```

```
QQ = ProbPlot(model_norm_residuals)
pl_B = QQ.qqplot(line='45', alpha=0.5, color='grey', lw=1)
pl_B.axes[0].set_title('Normal Q-Q')
pl_B.axes[0].set_xlabel('Theoretical Quantiles')
pl_B.axes[0].set_ylabel('Standardized Residuals');
abs_norm_resid = np.flip(np.argsort(np.abs(model_norm_residuals)), 0)
abs_norm_resid_top_3 = abs_norm_resid[:3]
for r, i in enumerate(abs_norm_resid_top_3):
pl_B.axes[0].annotate(i,
xy=(np.flip(QQ.theoretical_quantiles, 0)[r],
model_norm_residuals[i]))
```

The location of the points indicates that the standardized residual values of the model (standardized residuals of the model) do not have a perfect normal distribution. If the points are not on the red line – it is far from the red line – it means that the standardized residues are not normally distributed (the model has “heavy tails”).

If standardized residues do not have a normal distribution, the model can generate extreme results.

#### Let’s check which data load normal distribution¶

```
dfM.iloc[1610:1690, :].plot(kind='bar',figsize=(16,2))
```

Clearly, observation 1646, indicated by the Normal Q-Q Plot chart, significantly differs from the other data. To improve the quality of the regression model, outliers should be discarded from the data.

### Scale-Location plot ¶

The chart checks for inconsistencies in variance or heteroscedasticity. According to the homoscedastic principle, each probability distribution for y (result variable) has the same standard deviation, regardless of the value of x (predictor).

```
model_norm_residuals_abs_sqrt = np.sqrt(np.abs(model_norm_residuals))
model_abs_resid = np.abs(model_residuals)
```

```
pl_F = plt.figure()
plt.scatter(model_y, model_norm_residuals_abs_sqrt, color='grey', alpha=0.5);
sns.regplot(model_y, model_norm_residuals_abs_sqrt,
scatter=False,
ci=False,
lowess=True,
line_kws={'color': 'yellow', 'lw': 1, 'alpha': 1.0});
pl_F.axes[0].set_title('Scale-Location')
pl_F.axes[0].set_xlabel('Fitted values')
pl_F.axes[0].set_ylabel('$sqrt{|Standardized Residuals|}
```

The more horizontal the yellow line, the more likely the data is homoscedastic.

The heteroscedastic chart can be V-shaped, which means that compared to the middle section there are higher values on the left and right of the chart. This may be due to the non-disclosure of non-linearity in the model and this should be further investigated. The two most common methods of “consolidating” heteroscedasticity are:

- use of the least squares weighing method or
- use of heteroscedastically corrected covariance matrix (hccm).

The next graph indicated observation number 1646 as a segment disrupting the linear regression model.

### Residuals vs Leverage plot¶

When standardized residues do not have a normal distribution, extreme values of y results may occur. In the case of high leverage points, extremely independent x variables may appear. Extreme x seems to be so bad, but may have a detrimental effect on the model because the coefficients at x or β are very sensitive to leverage points. The purpose of the Residuals vs Leverage chart is to identify these problematic observations.

```
model_leverage = model_fit.get_influence().hat_matrix_diag
model_cooks = model_fit.get_influence().cooks_distance[0]
```

```
pl_C = plt.figure(figsize=(16,4));
plt.scatter(model_leverage, model_norm_residuals, color='grey', alpha=0.3);
sns.regplot(model_leverage, model_norm_residuals,
scatter=False,
ci=False,
lowess=True,
line_kws={'color': 'blue', 'lw': 1, 'alpha': 1.0});
pl_C.axes[0].set_xlim(0, max(model_leverage)+0.01)
pl_C.axes[0].set_ylim(-3, 5)
pl_C.axes[0].set_title('Residuals vs Leverage')
pl_C.axes[0].set_xlabel('Leverage')
pl_C.axes[0].set_ylabel('Standardized Residuals');
leverage_top_3 = np.flip(np.argsort(model_cooks), 0)[:3]
for i in leverage_top_3:
pl_C.axes[0].annotate(i,
xy=(model_leverage[i],
model_norm_residuals[i]));
```

Thanks to the Cook distance, we only need to find leverage points that have a distance greater than 0.5. These 0.5 are shown on the graph as a dashed curve at the top and bottom – shown when such outliers occur. We do not have any leverage points in this chart that are outside the 0.5 curve. Therefore, there are no outliers in the top right or bottom right of the chart.

The procedure with the Residuals vs Leverage chart is that outliers are removed from independent variables and the model is rebuilt. This procedure improves model properties.

The chart showed observation number 5690. Let’s see what the data is.

```
dfM.iloc[5660:5710, :].plot(kind='bar',figsize=(16,2), legend=False)
```

### Influence plot¶

The impact graph shows the residual values of the model as a function of the lever of each observation measured with the hat matrix. Externally learned residual values are scaled according to their standard deviation

Two impact measures are available: Cook and DFFITS.

```
fig, ax = plt.subplots(figsize=(18,3))
fig = sm.graphics.influence_plot(model_fit, ax=ax, criterion="cooks")
```

Interpretation

The size of the bubbles (in our chart you can not see it) means the size of the cook distance, the larger the bubble, the greater the cook parameter.

#### DFFITS¶

DFFITS is a diagnostic that is intended to show how much impact a point in the statistical regression proposed in 1980 has [1] It is defined as student DFFIT, where the latter is a change in the predicted value for the point obtained when this point is left outside the regression .

#### Cooks¶

is a commonly used estimation of the data point impact when performing the least squares regression analysis. [1] In practical ordinary least squares analysis, the Cook distance can be used in several ways:

- to identify influential data points that are particularly worth checking for validity; or
- indicate areas of the design space in which it would be good to obtain more data points.

This means that Cooks distance measures the impact of each observation in the model or “what would happen if each observation were not in the model,” and this is important because it is one way to detect outliers that particularly affects the regression line. When we are not looking for and treating potential outliers in our data, it is possible that the corrected coefficients for the model may not be the most representative or appropriate, which may lead to incorrect inferences.

#### The hat values¶

Hat values are fitted values or predictions made by the model for each observation. Completely different from Cook’s distance.

#### H levarage¶

H levarage measures how each input parameter X affects the fit model. In contrast, the Cook distance also includes the influence of the output parameter y.