https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kruskal.html
Unless you have a large sample size and can clearly demonstrate that your data are normal, you should routinely use Kruskal–Wallis; they think it is dangerous to use one-way anova, which assumes normality, when you don’t know for sure that your data are normal.
Example 1¶
There are four cost groups, please compare if the given groups are statistically different.
H0: there are no differences between cost groups
H1: there are differences between cost groups
import pandas as pd
GrupA = [57, 65, 50, 45, 70, 62, 48]
GrupB = [72, 81, 64, 55, 90, 38, 75]
GrupC = [35, 42, 58, 59, 46, 60, 61]
GrupD = [73, 85, 92, 68, 82, 94, 66]
df = pd.DataFrame({ 'GrupA': GrupA, 'GrupB':GrupB, 'GrupC':GrupC, 'GrupD':GrupD })
df
import scipy.stats as ss
H, p = ss.kruskal(df['GrupA'], df['GrupB'], df['GrupC'], df['GrupD'], nan_policy='omit')
print('p-value: ',p)
print('H statistics: ',H)
We reject the H0 hypothesis because p = 0.003 is less than 0.005 (p <0.005)
Przykład 2¶
H0: influenza is not statistically different
H1: influenza is statistically different
import numpy as np
A = [900, 1200,850, 1320,1400, 1150, 975,np.nan ]
B = [625, 640, 775, 1000,690, 550, 840,750]
C = [415, 400, 420, 560, 780, 620, 800,390]
D = [410, 310, 320, 280, 500, 385, 440,np.nan ]
E = [340, 425, 275, 210, 575, 360, np.nan ,np.nan]
df = pd.DataFrame({ 'A': A, 'B':B, 'C':C, 'D':D, 'E':E })
df
Defines how to handle when input contains nan. The following options are available (default is ‘propagate’):
‘propagate’: returns nan
‘raise’: throws an error
‘omit’: performs the calculations ignoring nan values
import scipy.stats as ss
H, p = ss.kruskal(df['B'], df['E'], nan_policy='omit')
print('p-value: ',p)
print('H statistics: ',H)
Groups B and E differ statistically because p <0.005
Example 3¶
Cafazzo et al. (2010) observed a group of freely moving domestic dogs on the outskirts of Rome. Based on observations from 1815, they were able to place dogs in the hierarchy of dominance, from the most dominant (Merlino) to the most submissive (Pisola). Because it is a truly ranking variable, it is necessary to use the Kruskal-Wallis test. The average rank for men (11.1) is lower than the average rank for women (17.7) and the difference is significant (H = 4.61, 1 df, P = 0.032).
dog= ['Merlino','Gastone','Pippo','Leon','Golia','Lancillotto','Mamy','Nanà','Isotta','Diana','Simba','Pongo','Semola','Kimba','Morgana','Stella','Jaś','Cucciola','Mammolo','Dotto','Gongolo','Małgosia','Brontolo','Eolo','Mag','Emy','Pisola']
Rang= [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27]
Seks = ['M','M','M','M','M','M','K','K','K','K','M','M','M','M','K','K','M','M','M','M','M','K','K','K','K','K','K']
H0: both groups are the same
H1: groups are different from each other
df2 = pd.DataFrame({ 'Name': dog, 'Sex':Seks, 'Rang':Rang })
df2
K = df2[df2['Sex']=='K']['Rang'].to_list()
M = df2[df2['Sex']=='M']['Rang'].to_list()
H, p = ss.kruskal(K, M, nan_policy='omit')
print('p-value: ',p)
print('H statistics: ',H)
H1: the groups differ from each other because p <0.05
