Testy Kruskal -Wallis

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html

Unless you have a large sample size and can clearly demonstrate that your data are normal, you should routinely use Kruskal–Wallis; they think it is dangerous to use one-way anova, which assumes normality, when you don’t know for sure that your data are normal.

Example 1

There are four cost groups, please compare if the given groups are statistically different.

H0: there are no differences between cost groups
H1: there are differences between cost groups

In [1]:
import pandas as pd

GrupA = [57, 65, 50, 45, 70, 62, 48]
GrupB = [72, 81, 64, 55, 90, 38, 75]
GrupC = [35, 42, 58, 59, 46, 60, 61]
GrupD = [73, 85, 92, 68, 82, 94, 66]

df = pd.DataFrame({ 'GrupA': GrupA, 'GrupB':GrupB, 'GrupC':GrupC, 'GrupD':GrupD })
df
Out[1]:
GrupA GrupB GrupC GrupD
0 57 72 35 73
1 65 81 42 85
2 50 64 58 92
3 45 55 59 68
4 70 90 46 82
5 62 38 60 94
6 48 75 61 66
In [2]:
import scipy.stats as ss

H, p = ss.kruskal(df['GrupA'], df['GrupB'], df['GrupC'], df['GrupD'], nan_policy='omit')
print('p-value:      ',p)
print('H statistics: ',H)
p-value:       0.003317738567191764
H statistics:  13.716396903589015

We reject the H0 hypothesis because p = 0.003 is less than 0.005 (p <0.005)

Przykład 2

H0: influenza is not statistically different

H1: influenza is statistically different

In [6]:
import numpy as np

A = [900, 1200,850, 1320,1400, 1150, 975,np.nan ]
B = [625, 640, 775, 1000,690,  550,  840,750]
C = [415, 400, 420, 560, 780,  620,  800,390]
D = [410, 310, 320, 280, 500,  385,  440,np.nan ]
E = [340, 425, 275, 210, 575,  360, np.nan  ,np.nan]

df = pd.DataFrame({ 'A': A, 'B':B, 'C':C, 'D':D, 'E':E })
df
Out[6]:
A B C D E
0 900.0 625 415 410.0 340.0
1 1200.0 640 400 310.0 425.0
2 850.0 775 420 320.0 275.0
3 1320.0 1000 560 280.0 210.0
4 1400.0 690 780 500.0 575.0
5 1150.0 550 620 385.0 360.0
6 975.0 840 800 440.0 NaN
7 NaN 750 390 NaN NaN

Defines how to handle when input contains nan. The following options are available (default is ‘propagate’):

‘propagate’: returns nan

‘raise’:     throws an error

‘omit’:      performs the calculations ignoring nan values
In [7]:
import scipy.stats as ss

H, p = ss.kruskal(df['B'], df['E'], nan_policy='omit')
print('p-value:      ',p)
print('H statistics: ',H)
p-value:       0.002984914427615507
H statistics:  8.816666666666663

Groups B and E differ statistically because p <0.005

Example 3

Cafazzo et al. (2010) observed a group of freely moving domestic dogs on the outskirts of Rome. Based on observations from 1815, they were able to place dogs in the hierarchy of dominance, from the most dominant (Merlino) to the most submissive (Pisola). Because it is a truly ranking variable, it is necessary to use the Kruskal-Wallis test. The average rank for men (11.1) is lower than the average rank for women (17.7) and the difference is significant (H = 4.61, 1 df, P = 0.032).

In [10]:
dog= ['Merlino','Gastone','Pippo','Leon','Golia','Lancillotto','Mamy','Nanà','Isotta','Diana','Simba','Pongo','Semola','Kimba','Morgana','Stella','Jaś','Cucciola','Mammolo','Dotto','Gongolo','Małgosia','Brontolo','Eolo','Mag','Emy','Pisola']
Rang= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]
Seks = ['M','M','M','M','M','M','K','K','K','K','M','M','M','M','K','K','M','M','M','M','M','K','K','K','K','K','K']

H0: both groups are the same

H1: groups are different from each other

In [12]:
df2 = pd.DataFrame({ 'Name': dog, 'Sex':Seks, 'Rang':Rang })
df2
Out[12]:
Name Sex Rang
0 Merlino M 1
1 Gastone M 2
2 Pippo M 3
3 Leon M 4
4 Golia M 5
5 Lancillotto M 6
6 Mamy K 7
7 Nanà K 8
8 Isotta K 9
9 Diana K 10
10 Simba M 11
11 Pongo M 12
12 Semola M 13
13 Kimba M 14
14 Morgana K 15
15 Stella K 16
16 Jaś M 17
17 Cucciola M 18
18 Mammolo M 19
19 Dotto M 20
20 Gongolo M 21
21 Małgosia K 22
22 Brontolo K 23
23 Eolo K 24
24 Mag K 25
25 Emy K 26
26 Pisola K 27
In [19]:
K = df2[df2['Sex']=='K']['Rang'].to_list()
M = df2[df2['Sex']=='M']['Rang'].to_list()
In [21]:
H, p = ss.kruskal(K, M, nan_policy='omit')
print('p-value:      ',p)
print('H statistics: ',H)
p-value:       0.03179486110380625
H statistics:  4.609523809523793

H1: the groups differ from each other because p <0.05