Statistics - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/category/statistics/ Wojciech Moszczyński Fri, 01 May 2020 12:31:12 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png Statistics - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/category/statistics/ 32 32 Testy Kruskal -Wallis https://sigmaquality.pl/statistics/testy-kruskal-wallis-160320201508/ Mon, 16 Mar 2020 14:09:36 +0000 http://sigmaquality.pl/testy-kruskal-wallis-160320201508/ https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html Unless you have a large sample size and can clearly demonstrate that your data are normal, you should routinely use Kruskal–Wallis; they think it [...]

Artykuł Testy Kruskal -Wallis pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html

Unless you have a large sample size and can clearly demonstrate that your data are normal, you should routinely use Kruskal–Wallis; they think it is dangerous to use one-way anova, which assumes normality, when you don’t know for sure that your data are normal.

Example 1

There are four cost groups, please compare if the given groups are statistically different.

H0: there are no differences between cost groups
H1: there are differences between cost groups

In [1]:
import pandas as pd

GrupA = [57, 65, 50, 45, 70, 62, 48]
GrupB = [72, 81, 64, 55, 90, 38, 75]
GrupC = [35, 42, 58, 59, 46, 60, 61]
GrupD = [73, 85, 92, 68, 82, 94, 66]

df = pd.DataFrame({ 'GrupA': GrupA, 'GrupB':GrupB, 'GrupC':GrupC, 'GrupD':GrupD })
df
Out[1]:
GrupA GrupB GrupC GrupD
0 57 72 35 73
1 65 81 42 85
2 50 64 58 92
3 45 55 59 68
4 70 90 46 82
5 62 38 60 94
6 48 75 61 66
In [2]:
import scipy.stats as ss

H, p = ss.kruskal(df['GrupA'], df['GrupB'], df['GrupC'], df['GrupD'], nan_policy='omit')
print('p-value:      ',p)
print('H statistics: ',H)
p-value:       0.003317738567191764
H statistics:  13.716396903589015

We reject the H0 hypothesis because p = 0.003 is less than 0.005 (p <0.005)

Przykład 2

H0: influenza is not statistically different

H1: influenza is statistically different

In [6]:
import numpy as np

A = [900, 1200,850, 1320,1400, 1150, 975,np.nan ]
B = [625, 640, 775, 1000,690,  550,  840,750]
C = [415, 400, 420, 560, 780,  620,  800,390]
D = [410, 310, 320, 280, 500,  385,  440,np.nan ]
E = [340, 425, 275, 210, 575,  360, np.nan  ,np.nan]

df = pd.DataFrame({ 'A': A, 'B':B, 'C':C, 'D':D, 'E':E })
df
Out[6]:
A B C D E
0 900.0 625 415 410.0 340.0
1 1200.0 640 400 310.0 425.0
2 850.0 775 420 320.0 275.0
3 1320.0 1000 560 280.0 210.0
4 1400.0 690 780 500.0 575.0
5 1150.0 550 620 385.0 360.0
6 975.0 840 800 440.0 NaN
7 NaN 750 390 NaN NaN

Defines how to handle when input contains nan. The following options are available (default is ‘propagate’):

‘propagate’: returns nan

‘raise’:     throws an error

‘omit’:      performs the calculations ignoring nan values
In [7]:
import scipy.stats as ss

H, p = ss.kruskal(df['B'], df['E'], nan_policy='omit')
print('p-value:      ',p)
print('H statistics: ',H)
p-value:       0.002984914427615507
H statistics:  8.816666666666663

Groups B and E differ statistically because p <0.005

Example 3

Cafazzo et al. (2010) observed a group of freely moving domestic dogs on the outskirts of Rome. Based on observations from 1815, they were able to place dogs in the hierarchy of dominance, from the most dominant (Merlino) to the most submissive (Pisola). Because it is a truly ranking variable, it is necessary to use the Kruskal-Wallis test. The average rank for men (11.1) is lower than the average rank for women (17.7) and the difference is significant (H = 4.61, 1 df, P = 0.032).

In [10]:
dog= ['Merlino','Gastone','Pippo','Leon','Golia','Lancillotto','Mamy','Nanà','Isotta','Diana','Simba','Pongo','Semola','Kimba','Morgana','Stella','Jaś','Cucciola','Mammolo','Dotto','Gongolo','Małgosia','Brontolo','Eolo','Mag','Emy','Pisola']
Rang= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]
Seks = ['M','M','M','M','M','M','K','K','K','K','M','M','M','M','K','K','M','M','M','M','M','K','K','K','K','K','K']

H0: both groups are the same

H1: groups are different from each other

In [12]:
df2 = pd.DataFrame({ 'Name': dog, 'Sex':Seks, 'Rang':Rang })
df2
Out[12]:
Name Sex Rang
0 Merlino M 1
1 Gastone M 2
2 Pippo M 3
3 Leon M 4
4 Golia M 5
5 Lancillotto M 6
6 Mamy K 7
7 Nanà K 8
8 Isotta K 9
9 Diana K 10
10 Simba M 11
11 Pongo M 12
12 Semola M 13
13 Kimba M 14
14 Morgana K 15
15 Stella K 16
16 Jaś M 17
17 Cucciola M 18
18 Mammolo M 19
19 Dotto M 20
20 Gongolo M 21
21 Małgosia K 22
22 Brontolo K 23
23 Eolo K 24
24 Mag K 25
25 Emy K 26
26 Pisola K 27
In [19]:
K = df2[df2['Sex']=='K']['Rang'].to_list()
M = df2[df2['Sex']=='M']['Rang'].to_list()
In [21]:
H, p = ss.kruskal(K, M, nan_policy='omit')
print('p-value:      ',p)
print('H statistics: ',H)
p-value:       0.03179486110380625
H statistics:  4.609523809523793

H1: the groups differ from each other because p <0.05

Artykuł Testy Kruskal -Wallis pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
ONE-WAY ANOVA analizy wpływuodległości od Paryża na przestępczość we Francji w 1833 roku https://sigmaquality.pl/statistics/one-way-anova-analizy-wplywuodleglosci-od-paryza-na-przestepczosc-we-francji-w-1833-roku-pl190120201944/ Mon, 20 Jan 2020 18:59:00 +0000 http://sigmaquality.pl/one-way-anova-analizy-wplywuodleglosci-od-paryza-na-przestepczosc-we-francji-w-1833-roku-pl190120201944/ PL190120201944 Guerry, Essay on the Moral Statistics of France (1833) Source: http://datavis.ca/gallery/guerry/guerrydat.html 1. dept wydział Num ID działu Standardowe numery dla działów, z wyjątkiem Korsyki [...]

Artykuł ONE-WAY ANOVA analizy wpływuodległości od Paryża na przestępczość we Francji w 1833 roku pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
PL190120201944

Guerry, Essay on the Moral Statistics of France (1833)
Source: http://datavis.ca/gallery/guerry/guerrydat.html

1. dept wydział Num ID działu Standardowe numery dla działów, z wyjątkiem Korsyki (200).
2. Region Region Char (1) Region Region Francji („N” = „Północ”, „S” = „Południe”, „E” = „Wschód”, „W” = „Zachód”, „C” = „Środek”). Korsyka ma kod „” (brak, NA)
3. Department Departament Char (25) Nazwa departamentu Nazwy departamentów są nazywane zgodnie z użyciem w 1830 roku. Zobacz depts.txt, aby uzyskać alternatywne kodowanie nazw departamentów.
4. Crime_pers Crime_pers Num Muzyka pop. za przestępstwa przeciwko osobom Źródło: A2 (Compte général, 1825-1830)
5. Crime_prop Crime_prop Num Muzyka pop. za przestępstwo przeciwko mieniu Źródło: A2 (Compte général, 1825-1830)
6. Literacy Alfabetyzacja Num Procent odczytu i zapisu Procent poborowych wojskowych, którzy potrafią czytać i pisać Źródło: A2
7. Donations Darowizny Num Darowizny dla biednych Źródło: A2 (Bulletin des lois)
8. Infants Niemowlęta Num Muzyka pop. na nielegalne urodzenie Źródło: A2 (Bureaau des Longitudes, 1817-1821)
9. Suicides Samobójstwa Num Muzyka pop. na samobójstwo Źródło: A2 (Compte général, 1827-1830)
10. MainCity Główne Miasto Char (5) Rozmiar głównego miasta Rozmiar głównego miasta („1: Sm”, „2: Med”, „3: Lg”), stosowany jako surogat dla gęstości zalewania. Duże odnosi się do górnej 10, małe do dolnej 10; wszystkie pozostałe są sklasyfikowane jako średnie. Źródło: A1
11. Wealth Bogactwo Ranga Podatek per capita od nieruchomości osobistych Indeks rankingowy oparty na podatkach od nieruchomości osobistych i ruchomych na mieszkańca Źródło: A1
12. Commerce Handel Ranga Handel i przemysł Handel i przemysł mierzony rangą liczby patentów / populacji Źródło: A1
13. Clergy Kler Ranga Dystrybucja duchowieństwa Dystrybucja duchowieństwa mierzona stopniem liczby księży katolickich w czynnej służbie / ludności Źródło: A1 (Almanach officiel du clergy, 1829)
14. Crime_parents Kryminalni rodzice Ranga Przestępstwa przeciwko rodzicom Przestępstwa przeciwko rodzicom, mierzone stopniem stosunku przestępstw przeciwko rodzicom do wszystkich przestępstw – Średnia z lat 1825–1830 Źródło: A1 (Compte général)
15. Infanticide Dzieciobójca Ranga Liczba dzieciobójstw na jednego mieszkańca Wskaźnik wskaźnikowy liczby dzieciobójstw do liczby ludności – Średnia z lat 1825–1830 Źródło: A1 (Compte général)
16. Donation_clergy Duchowieństwo Ranga Darowizny dla duchowieństwa Wskaźniki rankingowe liczby zapisów i darowizn między żywymi dla ludności – Średnia dla lat 1815–1824 Źródło: A1 (Bull. Des lois, ordunn. D’autorization)
17. Lottery Loteria Ranga Zakład per capita w Royal Lottery Ranking rankingowy wpływów z zakładów loterii królewskiej do liczby ludności — Średnia z lat 1822–1826 Źródło: A1 (Compte rendus par le ministre des finances)
18. Desertion Dezercja Ranga Dezercja wojskowa Desercja wojskowa, stosunek liczby młodych żołnierzy oskarżonych o dezercję do siły kontyngentu wojskowego, minus deficyt spowodowany niewystarczalnością dostępnych polan – średnia z lat 1825–1827. Źródło: A1 (Compte du ministere du guerre, 1829 etat V)
19. Instruction Instrukcja Ranga Rankingi instrukcji zapisane z mapy instrukcji Guerry’ego. Uwaga: jest to odwrotnie związane z umiejętnością czytania i pisania (zgodnie z definicją tutaj).
20. Prostitutes Prostytutki Num Prostytutki w Paryżu Liczba prostytutek zarejestrowanych w Paryżu w latach 1816–1834, sklasyfikowanych według departamentu ich urodzenia Źródło: Parent-Duchatelet (1836), De la prostitution en Paris
21. Distance Dystans Num Odległość do Paryża (km) Odległość każdego centroidu departamentu do centroidu Sekwany (Paryż) Źródło: obliczone na podstawie centroidów departamentów
22. Area Powierzchnia Num Obszar (1000 km^2) Źródło: Angeville (1836)
23. Pop1831 Pop1831 Num 1831 populacja ludności w 1831 roku, pochodzi z Angeville (1836), Essai sur la Statistique de la Populacja français w 1000s.

In [2]:
import pandas as pd

df = pd.read_csv('c:/1/Guerry.csv', index_col='Department')
del df['Unnamed: 0']
df.head(4)
Out[2]:
dept Region Crime_pers Crime_prop Literacy Donations Infants Suicides MainCity Wealth Crime_parents Infanticide Donation_clergy Lottery Desertion Instruction Prostitutes Distance Area Pop1831
Department
Ain 1 E 28870 15890 37 5098 33120 35039 2:Med 73 71 60 69 41 55 46 13 218.372 5762 346.03
Aisne 2 N 26226 5521 51 8901 14572 12831 2:Med 22 4 82 36 38 82 24 327 65.945 7369 513.00
Allier 3 C 26747 7925 13 10973 17044 114121 2:Med 61 46 42 76 66 16 85 34 161.927 7340 298.26
Basses-Alpes 4 E 12935 7289 46 2733 23018 14238 1:Sm 76 70 12 37 80 32 29 2 351.399 6925 155.90

4 rows × 22 columns

Celem naszego badanie jest sprawdzenie czy istnieje istotny statystycznie wpływ miasta Paryża na poziom przestępczości przeciwko osobom (Crime_pers). Podzielę wszystkie departamenty na 4 strefy odległości od miasta Paryża (Distance) i porówna poziom przestępczości przeciwko osobom.

– h0: poziom przestępczości (średnia) przeciwko osobom jest taki sam pomiędzy strefami
– h1: poziom przestępczości (średnia) przeciwko osobom jest statycznie inny pomiędzy strefami

In [9]:
df['Class_of_Distance']=pd.qcut(df['Distance'], 4,labels=["do 120 km","120-200 km","200-280 km","280-539 km"])
In [12]:
pd.pivot_table(df, index= ['Class_of_Distance'], values= "Distance", aggfunc= ['min','max', 'count'])
Out[12]:
min max count
Distance Distance Distance
Class_of_Distance
do 120 km 0.000 119.718 22
120-200 km 126.378 199.167 21
200-280 km 202.065 283.810 21
280-539 km 291.624 539.213 22

Nazwałęm te klasy dopiero po tym jak komputer je wskazał. Innymi słowy qcut – podzielil miasteczka na 4 klasy po równo, ja dodałem odpowiednie nazwy klas. Mając klasy Tworzę odpowiednią data frame aby utworzyć One-way ANOVA.

In [14]:
df[['Crime_pers','Class_of_Distance']].head(3)
Out[14]:
Crime_pers Class_of_Distance
Department
Ain 28870 200-280 km
Aisne 26226 do 120 km
Allier 26747 120-200 km

Test ANOVA

In [16]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model_H = ols('Crime_pers ~ C(Class_of_Distance)', data=df).fit()

anova_table = sm.stats.anova_lm(model_H, typ=2)
print(anova_table)
                            sum_sq    df          F    PR(>F)
C(Class_of_Distance)  1.408371e+09   3.0  11.392976  0.000003
Residual              3.378877e+09  82.0        NaN       NaN

Interpretacja: Wartość P uzyskana z analizy ANOVA jest znacząca (P

ANOVA pokazałe, że istnieje statystycznie istotna różnice w poziomie przestępczości. Niestety ANOVA nie wskazuje, które grupy obszary istotnie różnią się od siebie.
Aby poznać pary obszarów różniące się istotnie poziomem prtzestępczości, należy przeprowadzić analizę wielokrotnego porównania par (porównanie post-hoc) za pomocą testu HSD Tukeya.

Testu HSD Tukeya

In [18]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

m_comp = pairwise_tukeyhsd(endog=df['Crime_pers'], groups=df['Class_of_Distance'], alpha=0.05)
print(m_comp)
       Multiple Comparison of Means - Tukey HSD,FWER=0.05      
===============================================================
  group1     group2     meandiff     lower      upper    reject
---------------------------------------------------------------
120-200 km 200-280 km   -6203.0   -11398.3884 -1007.6116  True 
120-200 km 280-539 km -10450.2446 -15586.2551 -5314.234   True 
120-200 km do 120 km   -1923.1537  -7059.1642 3212.8569  False 
200-280 km 280-539 km  -4247.2446  -9383.2551  888.766   False 
200-280 km do 120 km   4279.8463   -856.1642  9415.8569  False 
280-539 km do 120 km   8527.0909   3451.1527  13603.0291  True 
---------------------------------------------------------------
In [37]:
m_comp.plot_simultaneous()    # Plot group confidence intervals
plt.vlines(x=20000,ymin=-0.5,ymax=3.5, color="red", alpha=0.8, linestyle='--')
Out[37]:
<matplotlib.collections.LineCollection at 0x14963a6ecc0>

Test Tukey HSD wykazał statystycznie istotną różnicę w przestępczości pomiędzy trzeba parami obszarów0:

– „120-200 km” i „200-280 km”
– „120-200 km” i „280-590 km”
– „280-590 km” i „do 120 km”

Powyższe wyniki z Tukey HSD sugerują, że oprócz wymienionych zestawów, wszystkie inne porównania par zachowują hipotezę zerową czyli w pozostałych obszarach nie istnieją istotne różnice statystyczne w poziomie przestępczości przeciwko osobom.

Sprawdzenie spełnienia warunków ANOVA

Warunki:

  1. wartości rezydualne mają rozkład normalny (test Shapiro Wilksa)
  2. wariancje w grupach są jednorodne (test Levene lub Bartlett)
  3. obserwacje są prowadzone niezależnie od siebie

Test Levene’a Sprawdzenie jednorodności wariancji

Hipoteza zerowa : grupy z populacji mają równe wariancje.

In [20]:
PKS = pd.pivot_table(df, index = 'Department', columns = 'Class_of_Distance', values='Crime_pers')
PKS.head(4)
Out[20]:
Class_of_Distance do 120 km 120-200 km 200-280 km 280-539 km
Department
Ain NaN NaN 28870.0 NaN
Aisne 26226.0 NaN NaN NaN
Allier NaN 26747.0 NaN NaN
Ardeche NaN NaN 9474.0 NaN
Test Levene’a nie toleruje pustych wartości NaN

In [21]:
P01=PKS['do 120 km'].dropna(how='any')
P02=PKS['120-200 km'].dropna(how='any')
P03=PKS['200-280 km'].dropna(how='any')
P04=PKS['280-539 km'].dropna(how='any')

Wykresy przedstawiają statystyki poziomu przestępczości w zadanych czterech obszarach.

In [29]:
PKS.plot.kde()
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x149634c7da0>
In [31]:
import matplotlib.pyplot as plt
PKS.boxplot(column=['do 120 km', '120-200 km', '200-280 km', '280-539 km'], grid=False)
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x149637838d0>
In [30]:
import scipy.stats as stats
w,p = stats.levene(P01,P02,P03,P04)
print("Value:   ",w)
print("p-value: ",p)
Value:    1.2094373013491309
p-value:  0.31156953536619264

Ponieważ wartość P nie jest nieznacząca (p>0.05), nie odrzucamy hipotezy zerowej – czyli grupy wiekowe mają jednorodne wariancje.

Test Shapiro-Wilk Sprawdzenia normalności rozkładu reszt

Hipoteza zerowa : wartości rezydualne mają rozkład normalny.

In [25]:
import scipy.stats as stats
import numpy as np


w, p = stats.shapiro(model_H.resid)
print("Value:   ",w)
print("p-value: ",np.round(p, decimals=2))
Value:    0.9815720915794373
p-value:  0.26

Ponieważ wartość P nie jest znaczące bo większa od współczynnika ufności 0.05 (p>0.05), nie ma podstaw do odrzucenia hipotezy zerowej – wartości rezydualne mają rozkład normalny.

Przyjrzymy się bliżej resztą modelu.

In [26]:
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot

qqplot(model_H.resid, line='s')
pyplot.show()

Faktynie reszty zachowują prawie idealny rozkłąd normalny.

 PL080120201039 

Artykuł ONE-WAY ANOVA analizy wpływuodległości od Paryża na przestępczość we Francji w 1833 roku pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Graficzny przykład ANOVA two-way – interaction_plot https://sigmaquality.pl/anova/graficzny-przyklad-anova-two-way-interaction_plot-pl202001081306/ Wed, 08 Jan 2020 18:08:00 +0000 http://sigmaquality.pl/graficzny-przyklad-anova-two-way-interaction_plot-pl202001081306/ PL202001081306 Source of data: http://faculty.webster.edu/woolflm/8canswer.html Model ANOVA two-way (two-factor) W tej analizie robimy porównie dwóch czyników w tych samych populacjach. Analiza ANOVA jest dla H01 [...]

Artykuł Graficzny przykład ANOVA two-way – interaction_plot pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
PL202001081306

Source of data: http://faculty.webster.edu/woolflm/8canswer.html

Model ANOVA two-way (two-factor)

W tej analizie robimy porównie dwóch czyników w tych samych populacjach.

Analiza ANOVA jest dla H01 i H02 dwiema ANOVA jednoczynnikowymi jednak przy H03 jest już połączeniem tych dwóch ANOVA one-way czyli ANOVA two-way.

Przeprowadzono badania, których celem było określenie poziomu radości życia. Postawiono następujące hipotezy zerowe:

  • h0a – średnia radość życia jest taka sama dla kobiet i mężczyzn
  • H0b – średnia radość życia jest taka sama wśród trzech grup: 'Young Adult’, 'Middle Adult’, 'Older Adult’.
  • H0c – nie istnieje interakcji pomiędzy grupami: 'Young Adult’, 'Middle Adult’, 'Older Adult’ a płcią badanych osób mającej wpływ na poziom radość życia.

Kodujemy bazwy grup:

  • ’Young Adult’ :1
  • ’Middle Adult’:2
  • ’Older Adult’ :3
In [5]:
import pandas as pd

cc =['Male', 'Male', 'Male', 'Male','Male','Female','Female','Female','Female','Female']
aaa	 = [4,2,3,4,2,7,4,3,6,5]
bbb = [7,5,7,5,6,8,10,7,7,8]
ccc = [10,7,9,8,11,10,9,12,11,13]
df = pd.DataFrame({'Group': cc, 1:aaa, 2:bbb, 3:ccc })
df
Out[5]:
Group 1 2 3
0 Male 4 7 10
1 Male 2 5 7
2 Male 3 7 9
3 Male 4 5 8
4 Male 2 6 11
5 Female 7 8 10
6 Female 4 10 9
7 Female 3 7 12
8 Female 6 7 11
9 Female 5 8 13

Badaliśy już ten zbiór danych pod kontem możliwości wykorzystania go w analizie ANOVA.

  • Próby zostały pobrane niezależnie od siebie z każdej z r populacji
  • W każdej z r badanych populacji rozkłąd jest normalny o tej samej wariancji. Średnie w r populacjach mogą być równę, lecz nie muszą*

Analiza ANOVA tej próby jest opisana w linku poniżej:
http://sigmaquality.pl/python/praktyczny-przyklad-wykorzystania-anova-two-way-w-pythonie-pl080120201039/

Przekształacamy dataframe do postaci zgodniej ze statmodels.

In [12]:
df_melt = pd.melt(df.reset_index(),  id_vars=['Group'], value_vars=[1, 2, 3])
df_melt.columns = ['Sex','Age','value']
df_melt.sample(5)
Out[12]:
Sex Age value
9 Female 1 5
3 Male 1 4
0 Male 1 4
20 Male 3 10
28 Female 3 11
In [34]:
df_melt['Age'] = df_melt['Age'].astype('int64')
df_melt.dtypes
Out[34]:
Sex      object
Age       int64
value     int64
dtype: object

Teraz tworzymy dwa modele regresji wielorakiej OLS.

In [49]:
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

Reg1 = ols(formula = "value ~ Age + Sex", data = df_melt)
Fit1 = Reg1.fit()

Reg2 = ols(formula = "value ~ Age*Sex", data = df_melt)
Fit2 = Reg2.fit()
In [50]:
print(df_melt['value'].dtype)
print(df_melt['Age'].dtype)
print(df_melt['Sex'].dtype)
int64
int64
object
In [63]:
from statsmodels.graphics.factorplots import interaction_plot


fig = interaction_plot(df_melt['Age'], df_melt['Sex'],df_melt['value'],
             colors=['black','darkgreen'], markers=['D','x'], linestyles=["-", "--"], ylabel='radość życia', xlabel='Grupa wiekowa')
fig = interaction_plot(df_melt['Age'],df_melt['Sex'], Fit1.fittedvalues,
            colors=['black','darkgreen'], markers=['D','x'],linestyles=["-", "--"], ylabel='radość życia - OLS', xlabel='Grupa wiekowa')
fig = interaction_plot(df_melt['Age'],df_melt['Sex'], Fit2.fittedvalues,
             colors=['black','darkgreen'], markers=['D','x'], linestyles=["-", "--"], ylabel='radość życia - OLS interakcja', xlabel='Grupa wiekowa')

import matplotlib.pyplot as plt
plt.show()

Interpretacja wykresu interakcji

Wykres interakcji pokazuje w jaki sposób związek między jednym czynnikiem kategorycznym (płeć) a ciągłą reakcją zależy od wartości drugiego czynnika kategorycznego (grupa wiekowa). Ten wykres wyświetla średnie dla poziomów jednego czynnika na osi y oraz osobną linię dla każdego czynnika.

  • Równoległe linie oznaczają brak interakcji.
  • Linie nierównoległe – występuje interakcja.

Im bardziej nierównoległe są linie, tym większa jest siła interakcji.
Chociaż można użyć tego wykresu do wyświetlenia efektów, należy wykonać odpowiedni test ANOVA i ocenić istotność statystyczną efektów. Jeśli efekty interakcji są znaczące, nie można interpretować głównych efektów bez uwzględnienia efektów interakcji.

 PL080120201039 

Artykuł Graficzny przykład ANOVA two-way – interaction_plot pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
The Kruskal-Wallis H-test – one-way nonparametric ANOVA https://sigmaquality.pl/anova/the-one-way-nonparametric-anova-030120201045/ Fri, 03 Jan 2020 19:49:00 +0000 http://sigmaquality.pl/the-one-way-nonparametric-anova-030120201045/ EN030120201045 In the last post we did the ANOVA analysis of the variance comparison in groups. It turned out, unexpectedly, that one of the conditions [...]

Artykuł The Kruskal-Wallis H-test – one-way nonparametric ANOVA pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

EN030120201045

In the last post we did the ANOVA analysis of the variance comparison in groups. It turned out, unexpectedly, that one of the conditions of the correct ANOVA ols model was not met.

ANOVA the one-factor models compatibility of variance 020120200957

The ANOVA model has three basic conditions:

  1. residual values have a normal distribution (Wilks Shapiro test)
  2. variances in groups are homogeneous (Levene or Bartlett test)
  3. observations are carried out independently of each other.

ANOVA ols did not pass the first Wilks Shapiro test for normal distribution of residual values and Levene test.
Therefore, ANOVA cannot be used.
Now I will repeat the whole procedure from the previous entry. I will then perform an analysis of variance using a nonparametric test to replace ANOVA.

Analysis of air pollution by PT08.S1 (CO) using ANOVA

  • H0: Air pollution PT08.S1 (CO) does NOT differ significantly over four daily periods
  • H1: Air pollution PT08.S1 (CO) significantly differs over four daily periods

We analyze only one factor because we are interested in the quality of treatment (this is one factor) so we choose one-factor model of ANOVA.

data source:

In [1]:
import pandas as pd
df2 = pd.read_csv('c:/TF/AirQ_filled.csv')
df2.head(3)
Out[1]:
Unnamed: 0 Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 0 10/03/2004 18.00.00 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
1 1 10/03/2004 19.00.00 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255
2 2 10/03/2004 20.00.00 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502
In [2]:
df2.dtypes
Out[2]:
Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

We process the Time column into a number format.

In [3]:
df2['Time'] = df2.Time.str.slice(0,2)
In [4]:
df2['Time']= df2['Time'].convert_objects(convert_numeric=True) 
df2['Time'].dtype
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.
Out[4]:
dtype('int64')

We check the completeness of the data.

In [5]:
df2.isnull().sum()
Out[5]:
Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

I separate four daily periods.

In [6]:
df2['Periods'] = pd.qcut(df2.Time,4,labels=["0-5","6-12","12-18","18-24"])
In [7]:
df2.dtypes
Out[7]:
Unnamed: 0          int64
Date               object
Time                int64
CO(GT)            float64
PT08.S1(CO)       float64
C6H6(GT)          float64
PT08.S2(NMHC)     float64
NOx(GT)           float64
PT08.S3(NOx)      float64
NO2(GT)           float64
PT08.S4(NO2)      float64
PT08.S5(O3)       float64
T                 float64
RH                float64
AH                float64
Periods          category
dtype: object

I check daily times.

In [8]:
pd.pivot_table(df2, index='Periods', values='Time', aggfunc=['min', 'max'])
Out[8]:
min max
Time Time
Periods
0-5 0 5
6-12 6 11
12-18 12 18
18-24 19 23

Now we create a pivot table, where the columns will have four daily periods.

In [9]:
PKS = pd.pivot_table(df2, index = 'Date', columns = 'Periods', values='PT08.S1(CO)')
PKS.head(4)
Out[9]:
Periods 0-5 6-12 12-18 18-24
Date
01/01/2005 1119.166667 968.833333 1186.142857 1182.2
01/02/2005 1018.833333 1372.000000 1413.857143 1152.4
01/03/2005 737.666667 827.833333 826.285714 901.4
01/04/2004 1013.666667 1375.000000 1209.428571 1341.8

We accept research hypotheses for pollutants with the substance PT08.S1 (CO)

  • H0: Air pollution PT08.S1 (CO) does NOT differ significantly over four daily periods
  • H1: Air pollution PT08.S1 (CO) significantly differs over four daily periods

Test ANOVA

I change the name PT08.S1(CO) to PT08S1CO.

In [10]:
df2.rename(columns={'PT08.S1(CO)':'PT08S1CO'},inplace=True)
In [11]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model3 = ols('PT08S1CO ~ C(Periods)', data=df2).fit()

anova_table = sm.stats.anova_lm(model3, typ=2)
print(anova_table)
                  sum_sq      df           F  PR(>F)
C(Periods)  6.715672e+07     3.0  545.269232     0.0
Residual    3.839796e+08  9353.0         NaN     NaN

Interpretation: The P value obtained from ANOVA is significant (P <0.05), so we conclude that there are statistically significant differences in air pollution PT08.S1 (CO) over four daily periods.

Checking compliance with ANOVA

Conditions:

  1. residual values have a normal distribution (Wilks Shapiro test)
  2. variances in groups are homogeneous (Levene or Bartlett test)
  3. observations are carried out independently of each other

Levene’s test Checking variance homogeneity

Null hypothesis: a group from the population have equal variances.

Now we create a pivot table, where there will be 4 daily periods in columns.

In [12]:
PKS = pd.pivot_table(df2, index = 'Date', columns = 'Periods', values='PT08S1CO')
PKS.head(4)
Out[12]:
Periods 0-5 6-12 12-18 18-24
Date
01/01/2005 1119.166667 968.833333 1186.142857 1182.2
01/02/2005 1018.833333 1372.000000 1413.857143 1152.4
01/03/2005 737.666667 827.833333 826.285714 901.4
01/04/2004 1013.666667 1375.000000 1209.428571 1341.8
In [13]:
P01=PKS['0-5'].dropna(how='any')
P02=PKS['6-12'].dropna(how='any')
P03=PKS['12-18'].dropna(how='any')
P04=PKS['18-24'].dropna(how='any')
In [14]:
import scipy.stats as stats
w,p = stats.levene(P01,P02,P03,P04)
print("Value:   ",w)
print("p-value: ",p)
Value:    30.276341071035155
p-value:  5.0608857455897e-19

Because the P value is significant (p< 0.05), we reject the null hypothesis – i.e. age groups have not uniform variances!

Shapiro-Wilk test Checking the normality of residue distribution

Null hypothesis: residual values are normally distributed.
In [15]:
import numpy as np
import scipy.stats as stats

w, p = stats.shapiro(model3.resid)
print("Value:   ",w)
print("p-value: ",np.round(p, decimals=2))
Value:    0.967054009437561
p-value:  0.0
C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Because the P value of ZERO is significant because it is smaller than the confidence factor 0.05 (p< 0.05), we reject the null hypothesis – the residual values do not have a normal distribution.

We’ll take a closer look at the rest of the model. Let’s see how it looks in the chart.

In [16]:
x=model3.resid
title = "Residuals"
x_label = "level"
y_label = "probability"
In [17]:
def Dist1(x, ax, title, x_label, y_label):
    x.plot.kde(ax=ax, legend=False)
    ax.set_title(title, color='darkred', alpha=1)
    ax.set_ylabel(y_label, color='grey', alpha=0.6)
    ax.set_xlabel(x_label, color='grey', alpha=0.6)
In [18]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 3))
Dist1(x, ax, title, x_label, y_label)
In [19]:
import scipy
scipy.stats.anderson(model3.resid, dist='norm')
Out[19]:
AndersonResult(statistic=85.34820578111066, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

This time check the normal distribution of residues using the QQ chart.

Q-Q plot

In [20]:
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot

qqplot(model3.resid, line='s')
pyplot.show()

The Shapiro-Wilk test again

made from scipy library.

In [21]:
from scipy.stats import shapiro

stat, p = shapiro(model3.resid)
print('Statistics=
Statistics=0.967, p=0.000
C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

K ^ 2 D’Agostino tests

The K ^ 2 D’Agostino test (Ralpha D’Agostino) calculates summary statistics from the data, namely kurtosis and skewness, to determine whether the distribution of the data deviates from the normal distribution. This is a simple and commonly used statistical test of normality.

– Skew (skew) is a quantification of how far the distribution is shifted left or right, a measure of asymmetry in the distribution.
– Kurtosis quantifies the distribution of the tail.

In [22]:
from scipy.stats import normaltest

stat, p = normaltest(model3.resid)
print('Statistics=
Statistics=727.597, p=0.000

The Shapiro-Wilk test and D’Agostino’s K ^ 2 test indicated a disturbance in the non normality of the probability distribution of residual values.

If ANOVA does not meet the condition for the normal distribution of residual values, a nonparametric test should be performed.

The Kruskal-Wallis H-test

The Kruskal-Wallis H test (called nonparametric ANOVA) tests the null hypothesis that the median population of all groups is equal. This is a non-parametric version of ANOVA. The test works on 2 or more independent samples that can be of different sizes. The test does not assume distribution normality. Sometimes it is considered as a nonparametric alternative to one-way analysis of variance between groups. Note that rejecting the null hypothesis does not indicate which group is different. Post hoc comparisons between groups are required to determine which groups are different. The method accepts array structures, but not DataFrames.

The null hypothesis – the equality of distribution functions in compared populations.

In [23]:
PKS.head(3)
Out[23]:
Periods 0-5 6-12 12-18 18-24
Date
01/01/2005 1119.166667 968.833333 1186.142857 1182.2
01/02/2005 1018.833333 1372.000000 1413.857143 1152.4
01/03/2005 737.666667 827.833333 826.285714 901.4

We delete empty values from the data, otherwise the test will not work.

In [24]:
PKS = PKS.dropna(how='any')
PKS.isnull().sum()
Out[24]:
Periods
0-5      0
6-12     0
12-18    0
18-24    0
dtype: int64
In [25]:
import scipy.stats as ss

H, p = ss.kruskal(PKS['0-5'], PKS['6-12'], PKS['12-18'], PKS['18-24'])
print('p-value: ',p)
p-value:  6.9004301985820776e-77

Interpretation: p-value is significant (p <0.05) which means the need to reject the null hypothesis that the median population of all groups is equal.

The median is not equal!

https://scikit-posthocs.readthedocs.io/en/latest/tutorial/

To find out what time periods (data groups) differ in terms of the median, we need to perform post hoc tests. Conover test

Conover’s test

In statistics, the Conover quadratic rank test is a nonparametric version of the parametric Levene test for equality of variances. The Conover rank squares test is the only test of equality of variances that seems not to be parametric.

In [26]:
import scikit_posthocs as sp
FF = PKS.columns
x = [PKS['0-5'], PKS['6-12'], PKS['12-18'], PKS['18-24']]
CT = sp.posthoc_conover(x, p_adjust = 'holm')
CT
Out[26]:
1 2 3 4
1 -1.000000e+00 2.135208e-58 3.193252e-56 2.384163e-70
2 2.135208e-58 -1.000000e+00 7.132277e-01 1.247363e-01
3 3.193252e-56 7.132277e-01 -1.000000e+00 7.715147e-02
4 2.384163e-70 1.247363e-01 7.715147e-02 -1.000000e+00
In [27]:
sp.sign_table(CT)
Out[27]:
1 2 3 4
1 *** *** ***
2 *** NS NS
3 *** NS NS
4 *** NS NS

P-values are replaced with asterisks: – p < 0.05, – p < 0.01, – p < 0.001..

In [28]:
pc = sp.posthoc_conover(x, val_col='values', group_col='groups')
heatmap_args = {'linewidths': 0.25, 'linecolor': '0.5', 'clip_on': False, 'square': True, 'cbar_ax_bbox': [0.80, 0.35, 0.04, 0.3]}
sp.sign_plot(pc, **heatmap_args)
Out[28]:
(<matplotlib.axes._subplots.AxesSubplot at 0x153984ad470>,
 <matplotlib.colorbar.ColorbarBase at 0x1539847bfd0>)

From what could be determined only at night not from 0 to 5 am no contamination with PT08.S1 (CO) is different (smaller). This can also be seen on the graph.

In [29]:
import matplotlib.pyplot as plt
PKS.boxplot(column=['0-5', '6-12', '12-18', '18-24'], grid=False)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x15398656da0>

Artykuł The Kruskal-Wallis H-test – one-way nonparametric ANOVA pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
ANOVA the one-factor models compatibility of variance https://sigmaquality.pl/anova/anova-the-one-factor-models-compatibility-of-variance-020120200957/ Fri, 03 Jan 2020 19:01:00 +0000 http://sigmaquality.pl/anova-the-one-factor-models-compatibility-of-variance-020120200957/ Analysis of variance, ANOVA (statistical analysis of variance) – a statistical method used to study observations that depend on one or many factors acting simultaneously. [...]

Artykuł ANOVA the one-factor models compatibility of variance pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
 EN2020120200957 

Analysis of variance, ANOVA (statistical analysis of variance) – a statistical method used to study observations that depend on one or many factors acting simultaneously. This method explains with what probability the extracted factors may be the reason for differences between the observed group means. Analysis of variance was created in the 1920s by Ronald Fisher.

Variance analysis models can be divided into:

  • one-factor models – the impact of each factor is considered separately, this class of issues is dealt with by one-way analysis of variance
  • multifactorial models – the impact of various factors is considered together, this class of issues is dealt with by multifactorial analysis of variance.

Source: https://pl.wikipedia.org/wiki/Analiza_wariancji

We check if the insulin level in the blood of patients differs depending on age.

We analyze only one factor because we are interested in the quality of treatment (this is one factor) so we choose one-factor model of ANOVA.

Data source: https://www.kaggle.com/saurabh00007/diabetescsv

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('c:/1/diabetes.csv', usecols=['Insulin','Age'])
df.head(3)
Out[1]:
Insulin Age
0 0 50
1 0 31
2 0 32

We delete the missing data. The level of insulin in the blood cannot be zero, these records are removed as unsuitable for our tests.

In [2]:
import numpy as np

df['Insulin']=df['Insulin'].replace(0,np.nan)
df = df.dropna(how ='any')
df.isnull().sum()
Out[2]:
Insulin    0
Age        0
dtype: int64

We divide patients into four groups depending on their age.

In [3]:
df['Age_G']=pd.qcut(df['Age'], 4,labels=["young","adults","middle-aged","old"])
In [4]:
df.head()
Out[4]:
Insulin Age Age_G
3 94.0 21 young
4 168.0 33 middle-aged
6 88.0 26 adults
8 543.0 53 old
13 846.0 59 old

Test ANOVA

In [5]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model2 = ols('Insulin ~ C(Age_G)', data=df).fit()

anova_table = sm.stats.anova_lm(model2, typ=2)
print(anova_table)
                sum_sq     df         F    PR(>F)
C(Age_G)  2.065342e+05    3.0  5.030065  0.001973
Residual  5.337793e+06  390.0       NaN       NaN

Interpretation: The P value obtained from ANOVA is significant (P <0.05), so we conclude that there is a statistically significant difference between blood insulin levels between the four age groups listed.

ANOVA have shown that there is a statistically significant difference in insulin levels between the age groups of patients. Unfortunately, ANOVA does not indicate which age groups are significantly different from each other.

To find out age group pairs with significantly different blood insulin levels, multiple pairs comparison analysis (post-hoc comparison) should be performed using Tukey’s HSD test.

Post hoc tests

Post-hoc tests should only be performed if p ANOVA is significant (i.e. greater than 0.05).
Post hoc (post hoc comparisons) – statistical tests performed after obtaining a significant F value after analyzing variance. They show which means differ statistically significantly. If the analysis of variance shows significant differences between the groups under consideration, further tests should be carried out to answer the question of which of the compared populations are responsible for rejecting the null hypothesis. We want to know which of the n averages are different and which are equal. For this purpose we use special post-hoc tests (after the fact), also called multiple comparison tests. The name of these tests results from the fact that we use them only after finding (by analysis of variance) the lack of equality between the means. These tests are also sometimes called homogeneous grouping tests, because after their application we can get medium groups. means belonging to the same group do not differ significantly, while means belonging to different groups will differ significantly (http://home.agh.edu.pl/~bartus/index.php?action=dydaktyka&subaction=statystyka&item=testy_post_hoc)

Tukey HSD test

In [40]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

m_comp = pairwise_tukeyhsd(endog=df['Insulin'], groups=df['Age_G'], alpha=0.05)
print(m_comp)
   Multiple Comparison of Means - Tukey HSD,FWER=0.05   
========================================================
   group1      group2   meandiff  lower    upper  reject
--------------------------------------------------------
   adults   middle-aged 38.6753  -4.6784   82.029 False 
   adults       old     54.9994  11.5281  98.4706  True 
   adults      young     4.8819  -37.3082 47.0721 False 
middle-aged     old     16.3241  -27.5906 60.2388 False 
middle-aged    young    -33.7933 -76.4403  8.8536 False 
    old        young    -50.1174 -92.8839  -7.351  True 
--------------------------------------------------------

The Tukey HSD test does not give us information on average location and variance. Fortunately, we have other ways to suspect this. The position of the red vertical line is rather indicative. This confirms the results of the Tukey HSD test.

In [41]:
m_comp.plot_simultaneous()    # Plot group confidence intervals
plt.vlines(x=160,ymin=-0.5,ymax=3.5, color="red")
Out[41]:
<matplotlib.collections.LineCollection at 0x1b8e6ec5208>

The Tukey HSD test showed a statistically significant difference in blood insulin content between the „old” and „young” flus as well as „adults” and „old”
The above results from Tukey HSD suggest that, apart from the old-young and adults-old comparison, all other pair comparisons retain the null hypothesis, i.e. in the other groups there are no significant statistical differences in blood insulin content.

To better understand mutual similarities of variance (or their absence) we can perform a t test. Inside the matrix there are values of p-value coefficients.

T-test

Pair T-test for many comparisons of independent groups. It can be used after parametric ANOVA for pairwise comparisons.

In [53]:
import scikit_posthocs as sp

sp.posthoc_ttest(df, val_col='Insulin', group_col='Age_G', p_adjust='holm')
ttest = sp.posthoc_ttest(df, val_col='Insulin', group_col='Age_G', p_adjust='holm')
ttest
Out[53]:
young adults middle-aged old
young -1.000000 0.747592 0.138593 0.023841
adults 0.747592 -1.000000 0.058815 0.006205
middle-aged 0.138593 0.058815 -1.000000 0.746077
old 0.023841 0.006205 0.746077 -1.000000

As you can easily see, there is a significant difference between the young -old and adults -old groups (high p-value> 0.05).

where ones indicate which pairs do not have the same average. P-values are replaced with asterisks: – p < 0.05, – p < 0.01, – p < 0.001..

In [54]:
sp.sign_table(ttest)
Out[54]:
young adults middle-aged old
young NS NS *
adults NS NS **
middle-aged NS NS NS
old * ** NS

Checking compliance with ANOVA

Conditions:

  1. residual values have a normal distribution (Wilks Shapiro test)
  2. variances in groups are homogeneous (Levene or Bartlett test)
  3. observations are carried out independently of each other

Levene’s test Checking the homogeneity of variance

Null hypothesis: a group from the population have equal variances.

In [7]:
df.head()
Out[7]:
Insulin Age Age_G
3 94.0 21 young
4 168.0 33 middle-aged
6 88.0 26 adults
8 543.0 53 old
13 846.0 59 old
In [8]:
DFF = df.reset_index()  
In [9]:
KOT = pd.pivot_table(DFF,index='index',columns='Age_G', values='Insulin')
In [10]:
young=KOT['young'].dropna(how='any')
old=KOT['old'].dropna(how='any')
middleaged=KOT['middle-aged'].dropna(how='any')
adults=KOT['adults'].dropna(how='any')
In [11]:
import scipy.stats as stats
w,p = stats.levene(young,old,middleaged,adults)
print("Value:   ",w)
print("p-value: ",p)
Value:    0.8514514983093309
p-value:  0.46645004212147656

Because the P value is not significant (p> 0.05), we do not reject the null hypothesis – i.e. age groups have uniform variances.

Shapiro-Wilk test Checking normality of residue distribution

Null hypothesis: residual values are normally distributed.
In [12]:
import scipy.stats as stats
w, p = stats.shapiro(model2.resid)
print("Value:   ",w)
print("p-value: ",p)
Value:    0.8003953695297241
p-value:  1.1085818629731166e-21

Because the P value is significant (p< 0.05), we reject the null hypothesis – that is, the residual values have a normal distribution.

At the end we can make a chart comparing age groups:

In [13]:
import matplotlib.pyplot as plt
KOT.boxplot(column=['young', 'old', 'middle-aged', 'adults'], grid=False)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b8e56b8400>

Analysis of air pollution by PT08.S1 (CO) using ANOVA

  • H0: Air pollution PT08.S1 (CO) does NOT differ significantly over four daily periods
  • H1: Air pollution PT08.S1 (CO) significantly differs over four daily periods

We analyze only one factor because we are interested in the quality of treatment (this is one factor) so we choose one-factor model of ANOVA.

data source:

In [14]:
import pandas as pd
df2 = pd.read_csv('c:/TF/AirQ_filled.csv')
df2.head(3)
Out[14]:
Unnamed: 0 Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 0 10/03/2004 18.00.00 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
1 1 10/03/2004 19.00.00 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255
2 2 10/03/2004 20.00.00 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502
In [15]:
df2.dtypes
Out[15]:
Unnamed: 0         int64
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

We process the Time column into a number format.

In [16]:
df2['Time'] = df2.Time.str.slice(0,2)
In [17]:
df2['Time']= df2['Time'].convert_objects(convert_numeric=True) 
df2['Time'].dtype
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: FutureWarning: convert_objects is deprecated.  To re-infer data dtypes for object columns, use Series.infer_objects()
For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.
Out[17]:
dtype('int64')

We check the completeness of the data.

In [18]:
df2.isnull().sum()
Out[18]:
Unnamed: 0       0
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

I separate four daily periods.

In [19]:
df2['Periods'] = pd.qcut(df2.Time,4,labels=["0-5","6-12","12-18","18-24"])
In [20]:
df2.dtypes
Out[20]:
Unnamed: 0          int64
Date               object
Time                int64
CO(GT)            float64
PT08.S1(CO)       float64
C6H6(GT)          float64
PT08.S2(NMHC)     float64
NOx(GT)           float64
PT08.S3(NOx)      float64
NO2(GT)           float64
PT08.S4(NO2)      float64
PT08.S5(O3)       float64
T                 float64
RH                float64
AH                float64
Periods          category
dtype: object

I check daily times.

In [21]:
pd.pivot_table(df2, index='Periods', values='Time', aggfunc=['min', 'max'])
Out[21]:
min max
Time Time
Periods
0-5 0 5
6-12 6 11
12-18 12 18
18-24 19 23

Now we create a pivot table, where the columns will have four daily periods.

In [22]:
PKS = pd.pivot_table(df2, index = 'Date', columns = 'Periods', values='PT08.S1(CO)')
PKS.head(4)
Out[22]:
Periods 0-5 6-12 12-18 18-24
Date
01/01/2005 1119.166667 968.833333 1186.142857 1182.2
01/02/2005 1018.833333 1372.000000 1413.857143 1152.4
01/03/2005 737.666667 827.833333 826.285714 901.4
01/04/2004 1013.666667 1375.000000 1209.428571 1341.8

We accept research hypotheses for pollutants with the substance PT08.S1 (CO)

  • H0: Air pollution PT08.S1 (CO) does NOT differ significantly over four daily periods
  • H1: Air pollution PT08.S1 (CO) significantly differs over four daily periods

Test ANOVA

I change the name PT08.S1(CO) to PT08S1CO.

In [23]:
df2.rename(columns={'PT08.S1(CO)':'PT08S1CO'},inplace=True)
In [24]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

model3 = ols('PT08S1CO ~ C(Periods)', data=df2).fit()

anova_table = sm.stats.anova_lm(model3, typ=2)
print(anova_table)
                  sum_sq      df           F  PR(>F)
C(Periods)  6.715672e+07     3.0  545.269232     0.0
Residual    3.839796e+08  9353.0         NaN     NaN

Interpretation: The P value obtained from ANOVA is significant (P <0.05), so we conclude that there are statistically significant differences in air pollution PT08.S1 (CO) over four daily periods.

ANOVA has shown that there is a statistically significant difference in the level of contamination and the null hypothesis has been rejected in favor of the alternative hypothesis. Unfortunately, ANOVA does not indicate which periods (possibly) differ from each other.
To check and detect statistically significant differences, multiple pairs comparison analysis (post-hoc comparison) should be performed using Tukey’s HSD test.

Tukey HSD test

In [25]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

m_comp = pairwise_tukeyhsd(endog=df2['PT08S1CO'], groups=df2['Periods'], alpha=0.05)
print(m_comp)
Multiple Comparison of Means - Tukey HSD,FWER=0.05
===============================================
group1 group2 meandiff  lower    upper   reject
-----------------------------------------------
 0-5   12-18  188.375  173.7043 203.0456  True 
 0-5   18-24  213.6785 197.715  229.6419  True 
 0-5    6-12  184.4017 169.1811 199.6223  True 
12-18  18-24  25.3035   9.8635  40.7435   True 
12-18   6-12  -3.9733  -18.6439 10.6974  False 
18-24   6-12  -29.2768 -45.2402 -13.3133  True 
-----------------------------------------------

Turkey’s HDS test showed that only a comparison of groups 6-12 from 12-18 did not give statistically significant differences in PT08.S1 (CO) contamination.

Checking compliance with ANOVA

Conditions:
1.residual values have a normal distribution (Wilks Shapiro test)

  1. variances in groups are homogeneous (Levene or Bartlett test)
  2. observations are carried out independently of each other

Levene’s test Checking variance homogeneity

Null hypothesis: a group from the population have equal variances.

Now we create a pivot table, where there will be 4 daily periods in columns.

In [26]:
PKS = pd.pivot_table(df2, index = 'Date', columns = 'Periods', values='PT08S1CO')
PKS.head(4)
Out[26]:
Periods 0-5 6-12 12-18 18-24
Date
01/01/2005 1119.166667 968.833333 1186.142857 1182.2
01/02/2005 1018.833333 1372.000000 1413.857143 1152.4
01/03/2005 737.666667 827.833333 826.285714 901.4
01/04/2004 1013.666667 1375.000000 1209.428571 1341.8
In [27]:
P01=PKS['0-5'].dropna(how='any')
P02=PKS['6-12'].dropna(how='any')
P03=PKS['12-18'].dropna(how='any')
P04=PKS['18-24'].dropna(how='any')
In [28]:
import scipy.stats as stats
w,p = stats.levene(P01,P02,P03,P04)
print("Value:   ",w)
print("p-value: ",p)
Value:    30.276341071035155
p-value:  5.0608857455897e-19

Because the P value is significant (p< 0.05), we reject the null hypothesis – i.e. age groups have not uniform variances.

Shapiro-Wilk test Checking the normality of residue distribution

Null hypothesis: residual values are normally distributed.
In [29]:
import scipy.stats as stats
w, p = stats.shapiro(model3.resid)
print("Value:   ",w)
print("p-value: ",np.round(p, decimals=2))
Value:    0.967054009437561
p-value:  0.0
C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

Because the P value of ZERO is significant because it is smaller than the confidence factor 0.05 (p> 0.05), we reject the null hypothesis – the residual values do not have a normal distribution.

We’ll take a closer look at the rest of the model. Let’s see how it looks in the chart.

In [30]:
x=model3.resid
title = "Residuals"
x_label = "level"
y_label = "probability"
In [31]:
def Dist1(x, ax, title, x_label, y_label):
    x.plot.kde(ax=ax, legend=False)
    ax.set_title(title, color='darkred', alpha=1)
    ax.set_ylabel(y_label, color='grey', alpha=0.6)
    ax.set_xlabel(x_label, color='grey', alpha=0.6)
In [32]:
fig, ax = plt.subplots(figsize=(6, 3))
Dist1(x, ax, title, x_label, y_label)
In [33]:
import scipy
scipy.stats.anderson(model3.resid, dist='norm')
Out[33]:
AndersonResult(statistic=85.34820578111066, critical_values=array([0.576, 0.656, 0.787, 0.918, 1.092]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

This time check the normal distribution of residues using the QQ chart.

Q-Q plot

In [34]:
from statsmodels.graphics.gofplots import qqplot
from matplotlib import pyplot

qqplot(model3.resid, line='s')
pyplot.show()

The Shapiro-Wilk test again

made from scipy library.

In [35]:
from scipy.stats import shapiro

stat, p = shapiro(model3.resid)
print('Statistics=
Statistics=0.967, p=0.000
C:ProgramDataAnaconda3libsite-packagesscipystatsmorestats.py:1653: UserWarning: p-value may not be accurate for N > 5000.
  warnings.warn("p-value may not be accurate for N > 5000.")

K ^ 2 D’Agostino tests

The K ^ 2 D’Agostino test (Ralpha D’Agostino) calculates summary statistics from the data, namely kurtosis and skewness, to determine whether the distribution of the data deviates from the normal distribution. This is a simple and commonly used statistical test of normality.

– Skew (skew) is a quantification of how far the distribution is shifted left or right, a measure of asymmetry in the distribution.
– Kurtosis quantifies the distribution of the tail.

In [36]:
from scipy.stats import normaltest

stat, p = normaltest(model3.resid)
print('Statistics=
Statistics=727.597, p=0.000

The Shapiro-Wilk test and D’Agostino’s K^2 test indicated a disturbance in the normality of the probability distribution of residual values.

I can’t comment on that.

It turned out that the distribution of residual values from the OLS ANOVA model does not have a normal distribution, which means that we cannot use parametric ANOVA. So we have to use nonparametric methods to reduce variance. I will deal with this in the next post.

Finally a chart comparing the groups.

In [37]:
import matplotlib.pyplot as plt
PKS.boxplot(column=['0-5', '6-12', '12-18', '18-24'], grid=False)
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x1b8e6a87668>

Artykuł ANOVA the one-factor models compatibility of variance pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>