Variance analysis models can be divided into:

one-factor models – the impact of each factor is considered separately, this class of issues is dealt with by one-way analysis of variance
multifactorial models – the impact of various factors is considered together, this class of issues is dealt with by multifactorial analysis of variance.

Source: https://pl.wikipedia.org/wiki/Analiza_wariancji

We check if the insulin level in the blood of patients differs depending on age.

We analyze only one factor because we are interested in the quality of treatment (this is one factor) so we choose one-factor model of ANOVA.

Data source: https://www.kaggle.com/saurabh00007/diabetescsv

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('c:/1/diabetes.csv', usecols=['Insulin','Age'])
df.head(3)

We delete the missing data. The level of insulin in the blood cannot be zero, these records are removed as unsuitable for our tests.

import numpy as np

df['Insulin']=df['Insulin'].replace(0,np.nan)
df = df.dropna(how ='any')
df.isnull().sum()

Insulin    0
Age        0
dtype: int64

We divide patients into four groups depending on their age.

df['Age_G']=pd.qcut(df['Age'], 4,labels=["young","adults","middle-aged","old"])

df.head()

Test ANOVA

import statsmodels.api as sm
from statsmodels.formula.api import ols

model2 = ols('Insulin ~ C(Age_G)', data=df).fit()

anova_table = sm.stats.anova_lm(model2, typ=2)
print(anova_table)

                sum_sq     df         F    PR(>F)
C(Age_G)  2.065342e+05    3.0  5.030065  0.001973
Residual  5.337793e+06  390.0       NaN       NaN

   Multiple Comparison of Means - Tukey HSD,FWER=0.05   
========================================================
   group1      group2   meandiff  lower    upper  reject
--------------------------------------------------------
   adults   middle-aged 38.6753  -4.6784   82.029 False 
   adults       old     54.9994  11.5281  98.4706  True 
   adults      young     4.8819  -37.3082 47.0721 False 
middle-aged     old     16.3241  -27.5906 60.2388 False 
middle-aged    young    -33.7933 -76.4403  8.8536 False 
    old        young    -50.1174 -92.8839  -7.351  True 
--------------------------------------------------------

<matplotlib.collections.LineCollection at 0x1b8e6ec5208>

Value:    0.8514514983093309
p-value:  0.46645004212147656

Value:    0.8003953695297241
p-value:  1.1085818629731166e-21

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e56b8400>

Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.

dtype('int64')

Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Interpretation: The P value obtained from ANOVA is significant (P <0.05), so we conclude that there is a statistically significant difference between blood insulin levels between the four age groups listed.

ANOVA have shown that there is a statistically significant difference in insulin levels between the age groups of patients. Unfortunately, ANOVA does not indicate which age groups are significantly different from each other.

To find out age group pairs with significantly different blood insulin levels, multiple pairs comparison analysis (post-hoc comparison) should be performed using Tukey’s HSD test.

Post hoc tests

Post-hoc tests should only be performed if p ANOVA is significant (i.e. greater than 0.05).
Post hoc (post hoc comparisons) – statistical tests performed after obtaining a significant F value after analyzing variance. They show which means differ statistically significantly. If the analysis of variance shows significant differences between the groups under consideration, further tests should be carried out to answer the question of which of the compared populations are responsible for rejecting the null hypothesis. We want to know which of the n averages are different and which are equal. For this purpose we use special post-hoc tests (after the fact), also called multiple comparison tests. The name of these tests results from the fact that we use them only after finding (by analysis of variance) the lack of equality between the means. These tests are also sometimes called homogeneous grouping tests, because after their application we can get medium groups. means belonging to the same group do not differ significantly, while means belonging to different groups will differ significantly (http://home.agh.edu.pl/~bartus/index.php?action=dydaktyka&subaction=statystyka&item=testy_post_hoc)

Tukey HSD test

from statsmodels.stats.multicomp import pairwise_tukeyhsd

m_comp = pairwise_tukeyhsd(endog=df['Insulin'], groups=df['Age_G'], alpha=0.05)
print(m_comp)

   Multiple Comparison of Means - Tukey HSD,FWER=0.05   
========================================================
   group1      group2   meandiff  lower    upper  reject
--------------------------------------------------------
   adults   middle-aged 38.6753  -4.6784   82.029 False 
   adults       old     54.9994  11.5281  98.4706  True 
   adults      young     4.8819  -37.3082 47.0721 False 
middle-aged     old     16.3241  -27.5906 60.2388 False 
middle-aged    young    -33.7933 -76.4403  8.8536 False 
    old        young    -50.1174 -92.8839  -7.351  True 
--------------------------------------------------------

<matplotlib.collections.LineCollection at 0x1b8e6ec5208>

Value:    0.8514514983093309
p-value:  0.46645004212147656

Value:    0.8003953695297241
p-value:  1.1085818629731166e-21

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e56b8400>

Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.

dtype('int64')

Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Unnamed: 0          int64
Date               object
Time                int64
CO(GT)            float64
PT08.S1(CO)       float64
C6H6(GT)          float64
PT08.S2(NMHC)     float64
NOx(GT)           float64
PT08.S3(NOx)      float64
NO2(GT)           float64
PT08.S4(NO2)      float64
PT08.S5(O3)       float64
T                 float64
RH                float64
AH                float64
Periods          category
dtype: object

The Tukey HSD test does not give us information on average location and variance. Fortunately, we have other ways to suspect this. The position of the red vertical line is rather indicative. This confirms the results of the Tukey HSD test.

m_comp.plot_simultaneous()    # Plot group confidence intervals
plt.vlines(x=160,ymin=-0.5,ymax=3.5, color="red")

<matplotlib.collections.LineCollection at 0x1b8e6ec5208>

Value:    0.8514514983093309
p-value:  0.46645004212147656

Value:    0.8003953695297241
p-value:  1.1085818629731166e-21

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e56b8400>

Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.

dtype('int64')

Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Unnamed: 0          int64
Date               object
Time                int64
CO(GT)            float64
PT08.S1(CO)       float64
C6H6(GT)          float64
PT08.S2(NMHC)     float64
NOx(GT)           float64
PT08.S3(NOx)      float64
NO2(GT)           float64
PT08.S4(NO2)      float64
PT08.S5(O3)       float64
T                 float64
RH                float64
AH                float64
Periods          category
dtype: object

                  sum_sq      df           F  PR(>F)
C(Periods)  6.715672e+07     3.0  545.269232     0.0
Residual    3.839796e+08  9353.0         NaN     NaN

The Tukey HSD test showed a statistically significant difference in blood insulin content between the “old” and “young” flus as well as “adults” and “old”
The above results from Tukey HSD suggest that, apart from the old-young and adults-old comparison, all other pair comparisons retain the null hypothesis, i.e. in the other groups there are no significant statistical differences in blood insulin content.

To better understand mutual similarities of variance (or their absence) we can perform a t test. Inside the matrix there are values of p-value coefficients.

T-test

Pair T-test for many comparisons of independent groups. It can be used after parametric ANOVA for pairwise comparisons.

import scikit_posthocs as sp

sp.posthoc_ttest(df, val_col='Insulin', group_col='Age_G', p_adjust='holm')
ttest = sp.posthoc_ttest(df, val_col='Insulin', group_col='Age_G', p_adjust='holm')
ttest

As you can easily see, there is a significant difference between the young -old and adults -old groups (high p-value> 0.05).

where ones indicate which pairs do not have the same average. P-values are replaced with asterisks: – p < 0.05, – p < 0.01, – p < 0.001..

sp.sign_table(ttest)

Checking compliance with ANOVA¶

Conditions:

residual values have a normal distribution (Wilks Shapiro test)
variances in groups are homogeneous (Levene or Bartlett test)
observations are carried out independently of each other

Levene’s test Checking the homogeneity of variance

Null hypothesis: a group from the population have equal variances.

df.head()

DFF = df.reset_index()

KOT = pd.pivot_table(DFF,index='index',columns='Age_G', values='Insulin')

young=KOT['young'].dropna(how='any')
old=KOT['old'].dropna(how='any')
middleaged=KOT['middle-aged'].dropna(how='any')
adults=KOT['adults'].dropna(how='any')

import scipy.stats as stats
w,p = stats.levene(young,old,middleaged,adults)
print("Value:   ",w)
print("p-value: ",p)

Value:    0.8514514983093309
p-value:  0.46645004212147656

Value:    0.8003953695297241
p-value:  1.1085818629731166e-21

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e56b8400>

Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.

dtype('int64')

Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Unnamed: 0          int64
Date               object
Time                int64
CO(GT)            float64
PT08.S1(CO)       float64
C6H6(GT)          float64
PT08.S2(NMHC)     float64
NOx(GT)           float64
PT08.S3(NOx)      float64
NO2(GT)           float64
PT08.S4(NO2)      float64
PT08.S5(O3)       float64
T                 float64
RH                float64
AH                float64
Periods          category
dtype: object

                  sum_sq      df           F  PR(>F)
C(Periods)  6.715672e+07     3.0  545.269232     0.0
Residual    3.839796e+08  9353.0         NaN     NaN

Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
 0-5   12-18  188.375  173.7043 203.0456  True 
 0-5   18-24  213.6785 197.715  229.6419  True 
 0-5    6-12  184.4017 169.1811 199.6223  True 
12-18  18-24  25.3035   9.8635  40.7435   True 
12-18   6-12  -3.9733  -18.6439 10.6974  False 
18-24   6-12  -29.2768 -45.2402 -13.3133  True 
-----------------------------------------------

Because the P value is not significant (p> 0.05), we do not reject the null hypothesis – i.e. age groups have uniform variances.

Shapiro-Wilk test Checking normality of residue distribution

Null hypothesis: residual values are normally distributed.

import scipy.stats as stats
w, p = stats.shapiro(model2.resid)
print("Value:   ",w)
print("p-value: ",p)

Value:    0.8003953695297241
p-value:  1.1085818629731166e-21

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e56b8400>

Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.

dtype('int64')

Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Unnamed: 0          int64
Date               object
Time                int64
CO(GT)            float64
PT08.S1(CO)       float64
C6H6(GT)          float64
PT08.S2(NMHC)     float64
NOx(GT)           float64
PT08.S3(NOx)      float64
NO2(GT)           float64
PT08.S4(NO2)      float64
PT08.S5(O3)       float64
T                 float64
RH                float64
AH                float64
Periods          category
dtype: object

                  sum_sq      df           F  PR(>F)
C(Periods)  6.715672e+07     3.0  545.269232     0.0
Residual    3.839796e+08  9353.0         NaN     NaN

Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
 0-5   12-18  188.375  173.7043 203.0456  True 
 0-5   18-24  213.6785 197.715  229.6419  True 
 0-5    6-12  184.4017 169.1811 199.6223  True 
12-18  18-24  25.3035   9.8635  40.7435   True 
12-18   6-12  -3.9733  -18.6439 10.6974  False 
18-24   6-12  -29.2768 -45.2402 -13.3133  True 
-----------------------------------------------

Value:    30.276341071035155
p-value:  5.0608857455897e-19

Because the P value is significant (p< 0.05), we reject the null hypothesis – that is, the residual values have a normal distribution.

At the end we can make a chart comparing age groups:

import matplotlib.pyplot as plt
KOT.boxplot(column=['young', 'old', 'middle-aged', 'adults'], grid=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e56b8400>

Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.

dtype('int64')

Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Unnamed: 0          int64
Date               object
Time                int64
CO(GT)            float64
PT08.S1(CO)       float64
C6H6(GT)          float64
PT08.S2(NMHC)     float64
NOx(GT)           float64
PT08.S3(NOx)      float64
NO2(GT)           float64
PT08.S4(NO2)      float64
PT08.S5(O3)       float64
T                 float64
RH                float64
AH                float64
Periods          category
dtype: object

                  sum_sq      df           F  PR(>F)
C(Periods)  6.715672e+07     3.0  545.269232     0.0
Residual    3.839796e+08  9353.0         NaN     NaN

Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
 0-5   12-18  188.375  173.7043 203.0456  True 
 0-5   18-24  213.6785 197.715  229.6419  True 
 0-5    6-12  184.4017 169.1811 199.6223  True 
12-18  18-24  25.3035   9.8635  40.7435   True 
12-18   6-12  -3.9733  -18.6439 10.6974  False 
18-24   6-12  -29.2768 -45.2402 -13.3133  True 
-----------------------------------------------

Value:    30.276341071035155
p-value:  5.0608857455897e-19

Value:    0.967054009437561
p-value:  0.0

Analysis of air pollution by PT08.S1 (CO) using ANOVA

H0: Air pollution PT08.S1 (CO) does NOT differ significantly over four daily periods
H1: Air pollution PT08.S1 (CO) significantly differs over four daily periods

We analyze only one factor because we are interested in the quality of treatment (this is one factor) so we choose one-factor model of ANOVA.

data source:

import pandas as pd
df2 = pd.read_csv('c:/TF/AirQ_filled.csv')
df2.head(3)

df2.dtypes

Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

We process the Time column into a number format.

df2['Time'] = df2.Time.str.slice(0,2)

df2['Time']= df2['Time'].convert_objects(convert_numeric=True) 
df2['Time'].dtype

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.

dtype('int64')

Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

Unnamed: 0          int64
Date               object
Time                int64
CO(GT)            float64
PT08.S1(CO)       float64
C6H6(GT)          float64
PT08.S2(NMHC)     float64
NOx(GT)           float64
PT08.S3(NOx)      float64
NO2(GT)           float64
PT08.S4(NO2)      float64
PT08.S5(O3)       float64
T                 float64
RH                float64
AH                float64
Periods          category
dtype: object

                  sum_sq      df           F  PR(>F)
C(Periods)  6.715672e+07     3.0  545.269232     0.0
Residual    3.839796e+08  9353.0         NaN     NaN

Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
 0-5   12-18  188.375  173.7043 203.0456  True 
 0-5   18-24  213.6785 197.715  229.6419  True 
 0-5    6-12  184.4017 169.1811 199.6223  True 
12-18  18-24  25.3035   9.8635  40.7435   True 
12-18   6-12  -3.9733  -18.6439 10.6974  False 
18-24   6-12  -29.2768 -45.2402 -13.3133  True 
-----------------------------------------------

Value:    30.276341071035155
p-value:  5.0608857455897e-19

Value:    0.967054009437561
p-value:  0.0

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

AndersonResult(statistic=85.34820578111066, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

We check the completeness of the data.

df2.isnull().sum()

Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

I separate four daily periods.

df2['Periods'] = pd.qcut(df2.Time,4,labels=["0-5","6-12","12-18","18-24"])

df2.dtypes

Unnamed: 0          int64
Date               object
Time                int64
CO(GT)            float64
PT08.S1(CO)       float64
C6H6(GT)          float64
PT08.S2(NMHC)     float64
NOx(GT)           float64
PT08.S3(NOx)      float64
NO2(GT)           float64
PT08.S4(NO2)      float64
PT08.S5(O3)       float64
T                 float64
RH                float64
AH                float64
Periods          category
dtype: object

I check daily times.

pd.pivot_table(df2, index='Periods', values='Time', aggfunc=['min', 'max'])

Now we create a pivot table, where the columns will have four daily periods.

PKS = pd.pivot_table(df2, index = 'Date', columns = 'Periods', values='PT08.S1(CO)')
PKS.head(4)

We accept research hypotheses for pollutants with the substance PT08.S1 (CO)

H0: Air pollution PT08.S1 (CO) does NOT differ significantly over four daily periods
H1: Air pollution PT08.S1 (CO) significantly differs over four daily periods

Test ANOVA

I change the name PT08.S1(CO) to PT08S1CO.

df2.rename(columns={'PT08.S1(CO)':'PT08S1CO'},inplace=True)

import statsmodels.api as sm
from statsmodels.formula.api import ols

model3 = ols('PT08S1CO ~ C(Periods)', data=df2).fit()

anova_table = sm.stats.anova_lm(model3, typ=2)
print(anova_table)

                  sum_sq      df           F  PR(>F)
C(Periods)  6.715672e+07     3.0  545.269232     0.0
Residual    3.839796e+08  9353.0         NaN     NaN

Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
 0-5   12-18  188.375  173.7043 203.0456  True 
 0-5   18-24  213.6785 197.715  229.6419  True 
 0-5    6-12  184.4017 169.1811 199.6223  True 
12-18  18-24  25.3035   9.8635  40.7435   True 
12-18   6-12  -3.9733  -18.6439 10.6974  False 
18-24   6-12  -29.2768 -45.2402 -13.3133  True 
-----------------------------------------------

Value:    30.276341071035155
p-value:  5.0608857455897e-19

Value:    0.967054009437561
p-value:  0.0

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

AndersonResult(statistic=85.34820578111066, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

Statistics=0.967, p=0.000

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Statistics=727.597, p=0.000

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

Interpretation: The P value obtained from ANOVA is significant (P <0.05), so we conclude that there are statistically significant differences in air pollution PT08.S1 (CO) over four daily periods.

ANOVA has shown that there is a statistically significant difference in the level of contamination and the null hypothesis has been rejected in favor of the alternative hypothesis. Unfortunately, ANOVA does not indicate which periods (possibly) differ from each other.
To check and detect statistically significant differences, multiple pairs comparison analysis (post-hoc comparison) should be performed using Tukey’s HSD test.

Tukey HSD test

from statsmodels.stats.multicomp import pairwise_tukeyhsd

m_comp = pairwise_tukeyhsd(endog=df2['PT08S1CO'], groups=df2['Periods'], alpha=0.05)
print(m_comp)

Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
 0-5   12-18  188.375  173.7043 203.0456  True 
 0-5   18-24  213.6785 197.715  229.6419  True 
 0-5    6-12  184.4017 169.1811 199.6223  True 
12-18  18-24  25.3035   9.8635  40.7435   True 
12-18   6-12  -3.9733  -18.6439 10.6974  False 
18-24   6-12  -29.2768 -45.2402 -13.3133  True 
-----------------------------------------------

Value:    30.276341071035155
p-value:  5.0608857455897e-19

Value:    0.967054009437561
p-value:  0.0

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

AndersonResult(statistic=85.34820578111066, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

Statistics=0.967, p=0.000

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Statistics=727.597, p=0.000

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

Turkey’s HDS test showed that only a comparison of groups 6-12 from 12-18 did not give statistically significant differences in PT08.S1 (CO) contamination.

Checking compliance with ANOVA

Conditions:
1.residual values have a normal distribution (Wilks Shapiro test)

variances in groups are homogeneous (Levene or Bartlett test)
observations are carried out independently of each other

Levene’s test Checking variance homogeneity

Null hypothesis: a group from the population have equal variances.

Now we create a pivot table, where there will be 4 daily periods in columns.

PKS = pd.pivot_table(df2, index = 'Date', columns = 'Periods', values='PT08S1CO')
PKS.head(4)

P01=PKS['0-5'].dropna(how='any')
P02=PKS['6-12'].dropna(how='any')
P03=PKS['12-18'].dropna(how='any')
P04=PKS['18-24'].dropna(how='any')

import scipy.stats as stats
w,p = stats.levene(P01,P02,P03,P04)
print("Value:   ",w)
print("p-value: ",p)

Value:    30.276341071035155
p-value:  5.0608857455897e-19

Value:    0.967054009437561
p-value:  0.0

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

AndersonResult(statistic=85.34820578111066, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

Statistics=0.967, p=0.000

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Statistics=727.597, p=0.000

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

Because the P value is significant (p< 0.05), we reject the null hypothesis – i.e. age groups have not uniform variances.

Shapiro-Wilk test Checking the normality of residue distribution

Null hypothesis: residual values are normally distributed.

import scipy.stats as stats
w, p = stats.shapiro(model3.resid)
print("Value:   ",w)
print("p-value: ",np.round(p, decimals=2))

Value:    0.967054009437561
p-value:  0.0

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

AndersonResult(statistic=85.34820578111066, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

Statistics=0.967, p=0.000

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Statistics=727.597, p=0.000

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

Because the P value of ZERO is significant because it is smaller than the confidence factor 0.05 (p> 0.05), we reject the null hypothesis – the residual values do not have a normal distribution.

We’ll take a closer look at the rest of the model. Let’s see how it looks in the chart.

x=model3.resid
title = "Residuals"
x_label = "level"
y_label = "probability"

def Dist1(x, ax, title, x_label, y_label):
    x.plot.kde(ax=ax, legend=False)
    ax.set_title(title, color='darkred', alpha=1)
    ax.set_ylabel(y_label, color='grey', alpha=0.6)
    ax.set_xlabel(x_label, color='grey', alpha=0.6)

fig, ax = plt.subplots(figsize=(6, 3))
Dist1(x, ax, title, x_label, y_label)

AndersonResult(statistic=85.34820578111066, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

Statistics=0.967, p=0.000

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Statistics=727.597, p=0.000

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

import scipy
scipy.stats.anderson(model3.resid, dist='norm')

AndersonResult(statistic=85.34820578111066, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

This time check the normal distribution of residues using the QQ chart.

Q-Q plot

from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot

qqplot(model3.resid, line='s')
pyplot.show()

Statistics=0.967, p=0.000

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Statistics=727.597, p=0.000

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

The Shapiro-Wilk test again

made from scipy library.

from scipy.stats import shapiro

stat, p = shapiro(model3.resid)
print('Statistics=

Statistics=0.967, p=0.000

C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Statistics=727.597, p=0.000

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

K ^ 2 D’Agostino tests

The K ^ 2 D’Agostino test (Ralpha D’Agostino) calculates summary statistics from the data, namely kurtosis and skewness, to determine whether the distribution of the data deviates from the normal distribution. This is a simple and commonly used statistical test of normality.

– Skew (skew) is a quantification of how far the distribution is shifted left or right, a measure of asymmetry in the distribution.
– Kurtosis quantifies the distribution of the tail.

from scipy.stats import normaltest

stat, p = normaltest(model3.resid)
print('Statistics=

Statistics=727.597, p=0.000

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

The Shapiro-Wilk test and D’Agostino’s K^2 test indicated a disturbance in the normality of the probability distribution of residual values.

I can’t comment on that.

It turned out that the distribution of residual values from the OLS ANOVA model does not have a normal distribution, which means that we cannot use parametric ANOVA. So we have to use nonparametric methods to reduce variance. I will deal with this in the next post.

Finally a chart comparing the groups.

import matplotlib.pyplot as plt
PKS.boxplot(column=['0-5', '6-12', '12-18', '18-24'], grid=False)

<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

	young	adults	middle-aged	old
young	-1.000000	0.747592	0.138593	0.023841
adults	0.747592	-1.000000	0.058815	0.006205
middle-aged	0.138593	0.058815	-1.000000	0.746077
old	0.023841	0.006205	0.746077	-1.000000

	Unnamed: 0	Date	Time	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
0	0	10/03/2004	18.00.00	2.6	1360.0	11.9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.6	48.9	0.7578
1	1	10/03/2004	19.00.00	2.0	1292.0	9.4	955.0	103.0	1174.0	92.0	1559.0	972.0	13.3	47.7	0.7255
2	2	10/03/2004	20.00.00	2.2	1402.0	9.0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11.9	54.0	0.7502

Periods	0-5	6-12	12-18	18-24
Date
01/01/2005	1119.166667	968.833333	1186.142857	1182.2
01/02/2005	1018.833333	1372.000000	1413.857143	1152.4
01/03/2005	737.666667	827.833333	826.285714	901.4
01/04/2004	1013.666667	1375.000000	1209.428571	1341.8

Periods	0-5	6-12	12-18	18-24
Date
01/01/2005	1119.166667	968.833333	1186.142857	1182.2
01/02/2005	1018.833333	1372.000000	1413.857143	1152.4
01/03/2005	737.666667	827.833333	826.285714	901.4
01/04/2004	1013.666667	1375.000000	1209.428571	1341.8

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

ANOVA the one-factor models compatibility of variance

We check if the insulin level in the blood of patients differs depending on age.

Test ANOVA

Post hoc tests

Tukey HSD test

T-test

Checking compliance with ANOVA¶

Levene’s test Checking the homogeneity of variance

Null hypothesis: a group from the population have equal variances.

Shapiro-Wilk test Checking normality of residue distribution

Null hypothesis: residual values are normally distributed.

Analysis of air pollution by PT08.S1 (CO) using ANOVA

We accept research hypotheses for pollutants with the substance PT08.S1 (CO)

Test ANOVA

Tukey HSD test

Checking compliance with ANOVA

Levene’s test Checking variance homogeneity

Null hypothesis: a group from the population have equal variances.

Shapiro-Wilk test Checking the normality of residue distribution

Null hypothesis: residual values are normally distributed.

Q-Q plot

The Shapiro-Wilk test again

K ^ 2 D’Agostino tests

It turned out that the distribution of residual values from the OLS ANOVA model does not have a normal distribution, which means that we cannot use parametric ANOVA. So we have to use nonparametric methods to reduce variance. I will deal with this in the next post.

	Insulin	Age	Age_G
3	94.0	21	young
4	168.0	33	middle-aged
6	88.0	26	adults
8	543.0	53	old
13	846.0	59	old