PL170120201440

Guerry, Essay on the Moral Statistics of France (1833)

Source: http://datavis.ca/gallery/guerry/guerrydat.html

dept wydział Num ID działu Standardowe numery dla działów, z wyjątkiem Korsyki (200).
Region Region Char (1) Region Region Francji („N” = „Północ”, „S” = „Południe”, „E” = „Wschód”, „W” = „Zachód”, „C” = „Środek”). Korsyka ma kod „” (brak, NA)
Department Departament Char (25) Nazwa departamentu Nazwy departamentów są nazywane zgodnie z użyciem w 1830 roku. Zobacz depts.txt, aby uzyskać alternatywne kodowanie nazw departamentów.
Crime_pers Crime_pers Num Muzyka pop. za przestępstwa przeciwko osobom Źródło: A2 (Compte général, 1825-1830)
Crime_prop Crime_prop Num Muzyka pop. za przestępstwo przeciwko mieniu Źródło: A2 (Compte général, 1825-1830)
Literacy Alfabetyzacja Num Procent odczytu i zapisu Procent poborowych wojskowych, którzy potrafią czytać i pisać Źródło: A2
Donations Darowizny Num Darowizny dla biednych Źródło: A2 (Bulletin des lois)
Infants Niemowlęta Num Muzyka pop. na nielegalne urodzenie Źródło: A2 (Bureaau des Longitudes, 1817-1821)
Suicides Samobójstwa Num Muzyka pop. na samobójstwo Źródło: A2 (Compte général, 1827-1830)
MainCity Główne Miasto Char (5) Rozmiar głównego miasta Rozmiar głównego miasta („1: Sm”, „2: Med”, „3: Lg”), stosowany jako surogat dla gęstości zalewania. Duże odnosi się do górnej 10, małe do dolnej 10; wszystkie pozostałe są sklasyfikowane jako średnie. Źródło: A1
Wealth Bogactwo Ranga Podatek per capita od nieruchomości osobistych Indeks rankingowy oparty na podatkach od nieruchomości osobistych i ruchomych na mieszkańca Źródło: A1
Commerce Handel Ranga Handel i przemysł Handel i przemysł mierzony rangą liczby patentów / populacji Źródło: A1
Clergy Kler Ranga Dystrybucja duchowieństwa Dystrybucja duchowieństwa mierzona stopniem liczby księży katolickich w czynnej służbie / ludności Źródło: A1 (Almanach officiel du clergy, 1829)
Crime_parents Kryminalni rodzice Ranga Przestępstwa przeciwko rodzicom Przestępstwa przeciwko rodzicom, mierzone stopniem stosunku przestępstw przeciwko rodzicom do wszystkich przestępstw – Średnia z lat 1825–1830 Źródło: A1 (Compte général)
Infanticide Dzieciobójca Ranga Liczba dzieciobójstw na jednego mieszkańca Wskaźnik wskaźnikowy liczby dzieciobójstw do liczby ludności – Średnia z lat 1825–1830 Źródło: A1 (Compte général)
Donation_clergy Duchowieństwo Ranga Darowizny dla duchowieństwa Wskaźniki rankingowe liczby zapisów i darowizn między żywymi dla ludności – Średnia dla lat 1815–1824 Źródło: A1 (Bull. Des lois, ordunn. D’autorization)
Lottery Loteria Ranga Zakład per capita w Royal Lottery Ranking rankingowy wpływów z zakładów loterii królewskiej do liczby ludności — Średnia z lat 1822–1826 Źródło: A1 (Compte rendus par le ministre des finances)
Desertion Dezercja Ranga Dezercja wojskowa Desercja wojskowa, stosunek liczby młodych żołnierzy oskarżonych o dezercję do siły kontyngentu wojskowego, minus deficyt spowodowany niewystarczalnością dostępnych polan – średnia z lat 1825–1827. Źródło: A1 (Compte du ministere du guerre, 1829 etat V)
Instruction Instrukcja Ranga Rankingi instrukcji zapisane z mapy instrukcji Guerry’ego. Uwaga: jest to odwrotnie związane z umiejętnością czytania i pisania (zgodnie z definicją tutaj).
Prostitutes Prostytutki Num Prostytutki w Paryżu Liczba prostytutek zarejestrowanych w Paryżu w latach 1816–1834, sklasyfikowanych według departamentu ich urodzenia Źródło: Parent-Duchatelet (1836), De la prostitution en Paris
Distance Dystans Num Odległość do Paryża (km) Odległość każdego centroidu departamentu do centroidu Sekwany (Paryż) Źródło: obliczone na podstawie centroidów departamentów
Area Powierzchnia Num Obszar (1000 km^2) Źródło: Angeville (1836)
Pop1831 Pop1831 Num 1831 populacja ludności w 1831 roku, pochodzi z Angeville (1836), Essai sur la Statistique de la Populacja français w 1000s.

import pandas as pd

df = pd.read_csv('c:/1/Guerry.csv', index_col='Department')
del df['Unnamed: 0']
df.head(4)

import matplotlib.pyplot as plt

CORREL = df.corr().sort_values('Wealth')

plt.figure(figsize=(10,6))
CORREL['Wealth'].plot(kind='barh', color='red')
plt.title('Korelacja ze zmienną wynikową', fontsize=20)
plt.xlabel('Poziom korelacji')
plt.ylabel('Zmienne nezależne ciągłe')

Text(0, 0.5, 'Zmienne nezależne ciągłe')

Sprawdziliśmy już w części drugiej kompletność danych oraz format zmiennych dla zbioru: Guerry.csv.

Teraz będę budował regresjię wieloczynnikową.

Zmienną zależną będzie poziom lokalnej zamożności mieszkańców (’Wealth’)
zmiennymi opisującymi będą: prostytucja (Prostitutes) – wskazująca na istnienie lokalnego nadmiaru siły roboczej i braku zajęcia, poziom darowizn na kościół (’Donation_clergy’), oraz poziom lokalnego przemysłu i handlu (’Commerce’).

from statsmodels.formula.api import ols
import statsmodels.api as sm

X = df[['Commerce','Prostitutes','Donation_clergy']]
y = df['Wealth']

model = sm.OLS(y, sm.add_constant(X))
model_fit = model.fit()

model_fit.summary()

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

Przykład dobrej odpowiedzi modelu: zmienna 'Donation_clergy’¶

fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.plot_fit(model_fit, "Donation_clergy", ax=ax)

import seaborn as sns

sns.regplot(x='Donation_clergy', y='Wealth', data=df)

Przykład złej odpowiedzi modelu: zmienna 'Prostitutes’

fig, ax = plt.subplots(figsize=(12, 8))
fig = sm.graphics.plot_fit(model_fit, "Prostitutes", ax=ax)

sns.regplot(x='Prostitutes', y='Wealth', data=df)

Guerry, Essay on the Moral Statistics of France (1833)

Source: http://datavis.ca/gallery/guerry/guerrydat.html

dept wydział Num ID działu Standardowe numery dla działów, z wyjątkiem Korsyki (200).
Region Region Char (1) Region Region Francji („N” = „Północ”, „S” = „Południe”, „E” = „Wschód”, „W” = „Zachód”, „C” = „Środek”). Korsyka ma kod „” (brak, NA)
Department Departament Char (25) Nazwa departamentu Nazwy departamentów są nazywane zgodnie z użyciem w 1830 roku. Zobacz depts.txt, aby uzyskać alternatywne kodowanie nazw departamentów.
Crime_pers Crime_pers Num Muzyka pop. za przestępstwa przeciwko osobom Źródło: A2 (Compte général, 1825-1830)
Crime_prop Crime_prop Num Muzyka pop. za przestępstwo przeciwko mieniu Źródło: A2 (Compte général, 1825-1830)
Literacy Alfabetyzacja Num Procent odczytu i zapisu Procent poborowych wojskowych, którzy potrafią czytać i pisać Źródło: A2
Donations Darowizny Num Darowizny dla biednych Źródło: A2 (Bulletin des lois)
Infants Niemowlęta Num Muzyka pop. na nielegalne urodzenie Źródło: A2 (Bureaau des Longitudes, 1817-1821)
Suicides Samobójstwa Num Muzyka pop. na samobójstwo Źródło: A2 (Compte général, 1827-1830)
MainCity Główne Miasto Char (5) Rozmiar głównego miasta Rozmiar głównego miasta („1: Sm”, „2: Med”, „3: Lg”), stosowany jako surogat dla gęstości zalewania. Duże odnosi się do górnej 10, małe do dolnej 10; wszystkie pozostałe są sklasyfikowane jako średnie. Źródło: A1
Wealth Bogactwo Ranga Podatek per capita od nieruchomości osobistych Indeks rankingowy oparty na podatkach od nieruchomości osobistych i ruchomych na mieszkańca Źródło: A1
Commerce Handel Ranga Handel i przemysł Handel i przemysł mierzony rangą liczby patentów / populacji Źródło: A1
Clergy Kler Ranga Dystrybucja duchowieństwa Dystrybucja duchowieństwa mierzona stopniem liczby księży katolickich w czynnej służbie / ludności Źródło: A1 (Almanach officiel du clergy, 1829)
Crime_parents Kryminalni rodzice Ranga Przestępstwa przeciwko rodzicom Przestępstwa przeciwko rodzicom, mierzone stopniem stosunku przestępstw przeciwko rodzicom do wszystkich przestępstw – Średnia z lat 1825–1830 Źródło: A1 (Compte général)
Infanticide Dzieciobójca Ranga Liczba dzieciobójstw na jednego mieszkańca Wskaźnik wskaźnikowy liczby dzieciobójstw do liczby ludności – Średnia z lat 1825–1830 Źródło: A1 (Compte général)
Donation_clergy Duchowieństwo Ranga Darowizny dla duchowieństwa Wskaźniki rankingowe liczby zapisów i darowizn między żywymi dla ludności – Średnia dla lat 1815–1824 Źródło: A1 (Bull. Des lois, ordunn. D’autorization)
Lottery Loteria Ranga Zakład per capita w Royal Lottery Ranking rankingowy wpływów z zakładów loterii królewskiej do liczby ludności — Średnia z lat 1822–1826 Źródło: A1 (Compte rendus par le ministre des finances)
Desertion Dezercja Ranga Dezercja wojskowa Desercja wojskowa, stosunek liczby młodych żołnierzy oskarżonych o dezercję do siły kontyngentu wojskowego, minus deficyt spowodowany niewystarczalnością dostępnych polan – średnia z lat 1825–1827. Źródło: A1 (Compte du ministere du guerre, 1829 etat V)
Instruction Instrukcja Ranga Rankingi instrukcji zapisane z mapy instrukcji Guerry’ego. Uwaga: jest to odwrotnie związane z umiejętnością czytania i pisania (zgodnie z definicją tutaj).
Prostitutes Prostytutki Num Prostytutki w Paryżu Liczba prostytutek zarejestrowanych w Paryżu w latach 1816–1834, sklasyfikowanych według departamentu ich urodzenia Źródło: Parent-Duchatelet (1836), De la prostitution en Paris
Distance Dystans Num Odległość do Paryża (km) Odległość każdego centroidu departamentu do centroidu Sekwany (Paryż) Źródło: obliczone na podstawie centroidów departamentów
Area Powierzchnia Num Obszar (1000 km^2) Źródło: Angeville (1836)
Pop1831 Pop1831 Num 1831 populacja ludności w 1831 roku, pochodzi z Angeville (1836), Essai sur la Statistique de la Populacja français w 1000s.

import pandas as pd

df = pd.read_csv('c:/1/Guerry.csv', index_col='Department')
del df['Unnamed: 0']
df.head(4)

import matplotlib.pyplot as plt

CORREL = df.corr().sort_values('Wealth')

plt.figure(figsize=(10,6))
CORREL['Wealth'].plot(kind='barh', color='red')
plt.title('Korelacja ze zmienną wynikową', fontsize=20)
plt.xlabel('Poziom korelacji')
plt.ylabel('Zmienne nezależne ciągłe')

Text(0, 0.5, 'Zmienne nezależne ciągłe')

Sprawdziliśmy już w części drugiej kompletność danych oraz format zmiennych dla zbioru: Guerry.csv.

Teraz będę budował regresjię wieloczynnikową.¶

Zmienną zależną będzie poziom lokalnej zamożności mieszkańców (’Wealth’)
zmiennymi opisującymi będą: prostytucja (Prostitutes) – wskazująca na istnienie lokalnego nadmiaru siły roboczej i braku zajęcia, poziom darowizn na kościół (’Donation_clergy’), oraz poziom lokalnego przemysłu i handlu (’Commerce’).

from statsmodels.formula.api import ols
import statsmodels.api as sm

X = df[['Commerce','Prostitutes','Donation_clergy']]
y = df['Wealth']

model = sm.OLS(y, sm.add_constant(X))
model_fit = model.fit()

model_fit.summary()

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

Teraz analizujemy wartości odstające:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(18,13))
fig = sm.graphics.influence_plot(model_fit, ax=ax, criterion="cooks")
plt.axhline(y=0, color='green', linestyle='-', lw=0.6)
plt.axhline(y=2, color='r', linestyle='--', lw=0.6)
plt.axhline(y=-2, color='r', linestyle='--', lw=0.6)
plt.axvline(x= 0.2, color = 'r', linestyle='--', lw=0.6)
plt.axvline(x= 0.014, color = 'r', linestyle='--', lw=0.6)

Eliminujemy Seine jako miasto bardzo zniekształacające model. Problem tego miasta można zbadać przy okazji innych badań. Wykres wpływów jest bardzo dobrym narzędziem do wyszukiwania anomalii w danych.
Teraz eliminujemy miasto Seine z danych i robimy model od nowa.

df=df.drop(['Seine','Seine-et-Oise'],axis=0)
X = df[['Commerce','Prostitutes','Donation_clergy']]
y = df['Wealth']

model = sm.OLS(y, sm.add_constant(X))
model_fit = model.fit()

model_fit.summary()

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(18,13))
fig = sm.graphics.influence_plot(model_fit, ax=ax, criterion="cooks")
plt.axhline(y=0, color='green', linestyle='-', lw=0.6)
plt.axhline(y=2, color='r', linestyle='--', lw=0.6)
plt.axhline(y=-2, color='r', linestyle='--', lw=0.6)
plt.axvline(x= 0.2, color = 'r', linestyle='--', lw=0.6)
plt.axvline(x= 0.014, color = 'r', linestyle='--', lw=0.6)

Partial Regression Plots (Duncan plot)¶

Ponieważ wykonujemy regresje wielowymiarowe, nie możemy po prostu patrzeć na poszczególne wykresy dwuwymiarowe, aby rozróżnić relacje. Zamiast tego chcemy spojrzeć na zależność zmiennej zależnej i zmiennych niezależnych od innych zmiennych niezależnych. Możemy to zrobić za pomocą wykresów Partial Regression Plots zwanych również added variable plots.

fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.plot_partregress("Wealth", "Commerce", ["Commerce", "Donation_clergy"], data=df, ax=ax)

Na powyższym rysunku pokazano jaka jest zależność jakości gospodarki (na osi x) do poziomu bogactwa (na osi y) w kontekście darowizn na kościół.
Zauważmy, że odstają te miejscowości, które mają dużą odległość na wykresie wpływów: Lot-et-Garonne, Bouches-du-Rhone, Ardache, Vosges.

fig, ax = plt.subplots(figsize=(12,8))
fig = sm.graphics.plot_partregress("Wealth", "Commerce", ["Commerce", "Prostitutes"], data=df, ax=ax)

Tym razem analizujemy zależność jakości gospodarki (na osi x) do poziomu bogactwa (na osi y) w kontekście prostytucji (czyli poziomu emigracji kobiet do Paryża). Zauważmy, że odstają znowu te miejscowości, które mają dużą odległość na wykresie wpływów Cooka: Lot-et-Garonne, Bouches-du-Rhone (jako biedne), Ardache, Vosges (jako bogate).
Wykres Partial Regression Plots jest doskonałym narzędziem wyszukującym anomalie w danych.

Warto zauważyć, że mający dużą dzwignię na wykresie wpływów: Seine-inferieure na wykresie Duncan plot w kontekście darowizn na kościuł był jako bardzo nisko (biedny) w kontekście prostytucji ma zupełnie inne położenie. Miejscowość ta jest napewno kolejną anomalią.

Jak wyglądał by wykres Partial Regression Plots gdyby analizować tylko poziom 'Prostitutes’ bez kontekstu na 'Commerce’?¶

fix, ax = plt.subplots(figsize=(12,14))
fig = sm.graphics.plot_partregress("Wealth", "Commerce", ["Prostitutes"], data=df, ax=ax)

PL170120201220

In [1]:

import pandas as pd
df = pd.read_csv('c:/1/WorldHappinessReport.csv', index_col='Country')
df.head(3)
# 19:20 - 15 minut -21:55

print(df.index)

Index([         'Afghanistan',              'Albania',              'Algeria',
                     'Angola',            'Argentina',              'Armenia',
                  'Australia',              'Austria',           'Azerbaijan',
                    'Bahrain',
       ...
       'United Arab Emirates',       'United Kingdom',        'United States',
                    'Uruguay',           'Uzbekistan',            'Venezuela',
                    'Vietnam',                'Yemen',               'Zambia',
                   'Zimbabwe'],
      dtype='object', name='Country', length=495)

Region                           0
Happiness Rank                   0
Happiness Score                  0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
Year                             0
dtype: int64

Region                            object
Happiness Rank                   float64
Happiness Score                  float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
Year                             float64
dtype: object

Happiness Rank                  -0.791931
Generosity                      -0.017458
Dystopia Residual                0.039211
Year                             0.133030
Trust (Government Corruption)    0.298720
Freedom                          0.344037
Family                           0.584572
Happiness Score                  0.787068
Health (Life Expectancy)         0.791212
Economy (GDP per Capita)         1.000000
Name: Economy (GDP per Capita), dtype: float64

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         HappinessScore   R-squared:                       0.619
Model:                            OLS   Adj. R-squared:                  0.619
Method:                 Least Squares   F-statistic:                     760.3
Date:                Fri, 17 Jan 2020   Prob (F-statistic):          4.90e-100
Time:                        11:56:55   Log-Likelihood:                -499.12
No. Observations:                 469   AIC:                             1002.
Df Residuals:                     467   BIC:                             1011.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    3.3706      0.079     42.419      0.000       3.214       3.527
Economy_GDPperCapita     2.1585      0.078     27.573      0.000       2.005       2.312
==============================================================================
Omnibus:                        1.916   Durbin-Watson:                   2.080
Prob(Omnibus):                  0.384   Jarque-Bera (JB):                1.718
Skew:                           0.037   Prob(JB):                        0.423
Kurtosis:                       2.713   Cond. No.                         4.68
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       ...
       'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='Country', length=469)

3.370636826209402

Text(0.5, 1.0, 'Linear Regression Plot')

Stawiamy hipotezę zerową, że bogactwo (Economy (GDP per Capita)) jest bezpośrednią przyczyną szczęścia (Happiness Score).

1. Sprawdzamy czy dane są kompletne i jaki mają format

del df['Unnamed: 0']
df = df.dropna(how ='any')
df.isnull().sum()

Region                           0
Happiness Rank                   0
Happiness Score                  0
Economy (GDP per Capita)         0
Family                           0
Health (Life Expectancy)         0
Freedom                          0
Trust (Government Corruption)    0
Generosity                       0
Dystopia Residual                0
Year                             0
dtype: int64

df.dtypes

Region                            object
Happiness Rank                   float64
Happiness Score                  float64
Economy (GDP per Capita)         float64
Family                           float64
Health (Life Expectancy)         float64
Freedom                          float64
Trust (Government Corruption)    float64
Generosity                       float64
Dystopia Residual                float64
Year                             float64
dtype: object

2. Sprawdzamy korelację zmiennych niezależnych ze zmienną zależną¶

CORREL = df.corr().sort_values('Economy (GDP per Capita)')
CORREL['Economy (GDP per Capita)']

Happiness Rank                  -0.791931
Generosity                      -0.017458
Dystopia Residual                0.039211
Year                             0.133030
Trust (Government Corruption)    0.298720
Freedom                          0.344037
Family                           0.584572
Happiness Score                  0.787068
Health (Life Expectancy)         0.791212
Economy (GDP per Capita)         1.000000
Name: Economy (GDP per Capita), dtype: float64

df.columns = [ 'Region', 'HappinessRank', 'HappinessScore',
       'Economy_GDPperCapita', 'Family', 'Health_LifeExpectancy_',
       'Freedom', 'Trust_GovernmentCorruption_', 'Generosity',
       'Dystopia_Residual', 'Year']

3. Robimy model OLS¶

from statsmodels.formula.api import ols
import statsmodels.api as sm

X = df['Economy_GDPperCapita']
y = df['HappinessScore']

model = sm.OLS(y, sm.add_constant(X))
model_fit = model.fit()

print(model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:         HappinessScore   R-squared:                       0.619
Model:                            OLS   Adj. R-squared:                  0.619
Method:                 Least Squares   F-statistic:                     760.3
Date:                Fri, 17 Jan 2020   Prob (F-statistic):          4.90e-100
Time:                        11:56:55   Log-Likelihood:                -499.12
No. Observations:                 469   AIC:                             1002.
Df Residuals:                     467   BIC:                             1011.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                    3.3706      0.079     42.419      0.000       3.214       3.527
Economy_GDPperCapita     2.1585      0.078     27.573      0.000       2.005       2.312
==============================================================================
Omnibus:                        1.916   Durbin-Watson:                   2.080
Prob(Omnibus):                  0.384   Jarque-Bera (JB):                1.718
Skew:                           0.037   Prob(JB):                        0.423
Kurtosis:                       2.713   Cond. No.                         4.68
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       ...
       'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='Country', length=469)

3.370636826209402

Text(0.5, 1.0, 'Linear Regression Plot')

Text(0.5, 0, 'Normalized residuals')

Index(['Ain', 'Aisne', 'Allier', 'Basses-Alpes', 'Hautes-Alpes', 'Ardeche',
       'Ardennes', 'Ariege', 'Aube', 'Aude', 'Aveyron', 'Bouches-du-Rhone',
       'Calvados', 'Cantal', 'Charente', 'Charente-Inferieure', 'Cher',
       'Correze', 'Cote-d'Or', 'Cotes-du-Nord', 'Creuse', 'Dordogne', 'Doubs',
       'Drome', 'Eure', 'Eure-et-Loir', 'Finistere', 'Gard', 'Haute-Garonne',
       'Gers', 'Gironde', 'Herault', 'Ille-et-Vilaine', 'Indre',
       'Indre-et-Loire', 'Isere', 'Jura', 'Landes', 'Loir-et-Cher', 'Loire',
       'Haute-Loire', 'Loire-Inferieure', 'Loiret', 'Lot', 'Lot-et-Garonne',
       'Lozere', 'Maine-et-Loire', 'Manche', 'Marne', 'Haute-Marne', 'Mayenne',
       'Meurthe', 'Meuse', 'Morbihan', 'Moselle', 'Nievre', 'Nord', 'Oise',
       'Orne', 'Pas-de-Calais', 'Puy-de-Dome', 'Basses-Pyrenees',
       'Hautes-Pyrenees', 'Pyrenees-Orientales', 'Bas-Rhin', 'Haut-Rhin',
       'Rhone', 'Haute-Saone', 'Saone-et-Loire', 'Sarthe', 'Seine',
       'Seine-Inferieure', 'Seine-et-Marne', 'Seine-et-Oise', 'Deux-Sevres',
       'Somme', 'Tarn', 'Tarn-et-Garonne', 'Var', 'Vaucluse', 'Vendee',
       'Vienne', 'Haute-Vienne', 'Vosges', 'Yonne', 'Corse'],
      dtype='object', name='Department')

dept               0
Region             1
Crime_pers         0
Crime_prop         0
Literacy           0
Donations          0
Infants            0
Suicides           0
MainCity           0
Wealth             0
Commerce           0
Clergy             0
Crime_parents      0
Infanticide        0
Donation_clergy    0
Lottery            0
Desertion          0
Instruction        0
Prostitutes        0
Distance           0
Area               0
Pop1831            0
dtype: int64

df.index

Index(['Afghanistan', 'Albania', 'Algeria', 'Angola', 'Argentina', 'Armenia',
       'Australia', 'Austria', 'Azerbaijan', 'Bahrain',
       ...
       'United Arab Emirates', 'United Kingdom', 'United States', 'Uruguay',
       'Uzbekistan', 'Venezuela', 'Vietnam', 'Yemen', 'Zambia', 'Zimbabwe'],
      dtype='object', name='Country', length=469)

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(18,13))
fig = sm.graphics.influence_plot(model_fit, ax=ax, criterion="cooks")
plt.axhline(y=0, color='green', linestyle='-', lw=0.6)
plt.axhline(y=2, color='r', linestyle='--', lw=0.6)
plt.axhline(y=-2, color='r', linestyle='--', lw=0.6)
plt.axvline(x= 0.01, color = 'r', linestyle='--', lw=0.6)
plt.axvline(x= 0.014, color = 'r', linestyle='--', lw=0.6)

3.370636826209402

Text(0.5, 1.0, 'Linear Regression Plot')

Text(0.5, 0, 'Normalized residuals')

Index(['Ain', 'Aisne', 'Allier', 'Basses-Alpes', 'Hautes-Alpes', 'Ardeche',
       'Ardennes', 'Ariege', 'Aube', 'Aude', 'Aveyron', 'Bouches-du-Rhone',
       'Calvados', 'Cantal', 'Charente', 'Charente-Inferieure', 'Cher',
       'Correze', 'Cote-d'Or', 'Cotes-du-Nord', 'Creuse', 'Dordogne', 'Doubs',
       'Drome', 'Eure', 'Eure-et-Loir', 'Finistere', 'Gard', 'Haute-Garonne',
       'Gers', 'Gironde', 'Herault', 'Ille-et-Vilaine', 'Indre',
       'Indre-et-Loire', 'Isere', 'Jura', 'Landes', 'Loir-et-Cher', 'Loire',
       'Haute-Loire', 'Loire-Inferieure', 'Loiret', 'Lot', 'Lot-et-Garonne',
       'Lozere', 'Maine-et-Loire', 'Manche', 'Marne', 'Haute-Marne', 'Mayenne',
       'Meurthe', 'Meuse', 'Morbihan', 'Moselle', 'Nievre', 'Nord', 'Oise',
       'Orne', 'Pas-de-Calais', 'Puy-de-Dome', 'Basses-Pyrenees',
       'Hautes-Pyrenees', 'Pyrenees-Orientales', 'Bas-Rhin', 'Haut-Rhin',
       'Rhone', 'Haute-Saone', 'Saone-et-Loire', 'Sarthe', 'Seine',
       'Seine-Inferieure', 'Seine-et-Marne', 'Seine-et-Oise', 'Deux-Sevres',
       'Somme', 'Tarn', 'Tarn-et-Garonne', 'Var', 'Vaucluse', 'Vendee',
       'Vienne', 'Haute-Vienne', 'Vosges', 'Yonne', 'Corse'],
      dtype='object', name='Department')

dept               0
Region             1
Crime_pers         0
Crime_prop         0
Literacy           0
Donations          0
Infants            0
Suicides           0
MainCity           0
Wealth             0
Commerce           0
Clergy             0
Crime_parents      0
Infanticide        0
Donation_clergy    0
Lottery            0
Desertion          0
Instruction        0
Prostitutes        0
Distance           0
Area               0
Pop1831            0
dtype: int64

dept                 int64
Region              object
Crime_pers           int64
Crime_prop           int64
Literacy             int64
Donations            int64
Infants              int64
Suicides             int64
MainCity            object
Wealth               int64
Commerce             int64
Clergy               int64
Crime_parents        int64
Infanticide          int64
Donation_clergy      int64
Lottery              int64
Desertion            int64
Instruction          int64
Prostitutes          int64
Distance           float64
Area                 int64
Pop1831            float64
dtype: object

Text(0, 0.5, 'Zmienne nezależne ciągłe')

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:57   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpretacja:
To jest wykres resztek modelu (wartości rezydualnych). Wielkość bąbli oznacza odległość Cooka.

Odległość Cooka

Jest powszechnie stosowanym oszacowaniem wpływu punktu danych podczas przeprowadzania analizy regresji metodą najmniejszych kwadratów. [1] W praktycznej analizie zwykłych metodą najmniejszych kwadratów odległość Cooka można wykorzystać na kilka sposobów:

do identyfikacji wpływowych punktów danych, które są szczególnie warte sprawdzenia pod kątem ważności, lub
wskazania obszaru przestrzeni projektowej, w których dobrze byłoby uzyskać więcej punktów danych.

Oznacza to, że odległość Cooksa mierzy wpływ każdej obserwacji w modelu lub odpowiada na pytanie: „co by się stało, gdyby określonej obserwacji nie było w modelu”, jest to jeden ze sposobów wykrywania wartości odstających, które szczególnie wpływają na linię regresji.
Wartości odstające w modelu mogą nie być najbardziej reprezentatywne lub odpowiednie, co może prowadzić do nieprawidłowych wniosków.

H Leverage oraz dystanst Cooka mierzy wpływ poszczególnych obserwacji na model OLS.
Wpływ obserwacji można rozpatrywać w kategoriach, jak bardzo przewidywane wyniki dla innych obserwacji byłyby różne, gdyby nie uwzględniono danej obserwacji. Jeśli przewidywania są takie same z daną obserwacją lub bez niej, obserwacja nie ma wpływu na model regresji.
Powszechną powszechną zasadą jest to, że obserwacja o wartości D Cooka powyżej 1,0 ma zbyt duży wpływ.

Wpływ obserwacji jest funkcją dwóch czynników:

dźwignią obserwacji (leverage) jak bardzo wartość obserwacji na zmiennej predyktora różni się od średniej zmiennej predyktora i
odległością obserwacji różnica między przewidywanym wynikiem dla obserwacji a jej rzeczywistym wynikiem.
http://onlinestatbook.com/2/regression/influential.html

INTERPRETACJA: Somalia, Mozambique oraz SomaliandRegion to bardzo wpływowe dane ponieważ bomble są wielki (czym większe ty większe odstawanie od danych wg Cooka). Na wykresie widać, że wartości rezydualne dla tych krajów bardzo odstają od średniej.

Wymienione kraje mają również wysoką dźwignię. Czyli można wysnuć wniosek że kraje te destabilizują model. To tak zwane wyjątki potwierdzające regułę.

Obserwacja z dużej odległości nie będzie miała tak dużego wpływu, jeśli jej dźwignia będzie niska. To połączenie dźwigni obserwacji i odległości determinuje jej wpływ.

Leverage

W statystyce, a zwłaszcza w analizie regresji, dźwignia jest miarą tego, jak daleko wartości zmiennych niezależnych w obserwacji są od wartości innych obserwacji.
Punkty o dużej dźwigni to ewentualne obserwacje dokonywane przy skrajnych lub odległych wartościach zmiennych niezależnych, tak że brak obserwacji sąsiednich oznacza, że dopasowany model regresji przejdzie blisko tej konkretnej obserwacji.
https://en.wikipedia.org/wiki/Leverage_(statistics)

Innymi słowy mówiąc, punkt o wysokiej dźwigni odsuwa linię regresji OLS od jej w łaściwego dopasowania. Czym Wyższy poziom levarage dla danej obserwacji tym obserwacja ta bardziej destabilizuje model OLS.

model_fit.params[0]

3.370636826209402

X = df['Economy_GDPperCapita']
y = df['HappinessScore']

## wykres--------------------------------------###
plt.figure()
plt.scatter(df.Economy_GDPperCapita, df.HappinessScore, c = 'grey')
plt.plot(df.Economy_GDPperCapita, model_fit.params[0] + model_fit.params[1] * df.Economy_GDPperCapita, c = 'r')
plt.xlabel('Economy_GDPperCapita')
plt.ylabel('HappinessScore')
plt.title("Linear Regression Plot")

Text(0.5, 1.0, 'Linear Regression Plot')

Text(0.5, 0, 'Normalized residuals')

Index(['Ain', 'Aisne', 'Allier', 'Basses-Alpes', 'Hautes-Alpes', 'Ardeche',
       'Ardennes', 'Ariege', 'Aube', 'Aude', 'Aveyron', 'Bouches-du-Rhone',
       'Calvados', 'Cantal', 'Charente', 'Charente-Inferieure', 'Cher',
       'Correze', 'Cote-d'Or', 'Cotes-du-Nord', 'Creuse', 'Dordogne', 'Doubs',
       'Drome', 'Eure', 'Eure-et-Loir', 'Finistere', 'Gard', 'Haute-Garonne',
       'Gers', 'Gironde', 'Herault', 'Ille-et-Vilaine', 'Indre',
       'Indre-et-Loire', 'Isere', 'Jura', 'Landes', 'Loir-et-Cher', 'Loire',
       'Haute-Loire', 'Loire-Inferieure', 'Loiret', 'Lot', 'Lot-et-Garonne',
       'Lozere', 'Maine-et-Loire', 'Manche', 'Marne', 'Haute-Marne', 'Mayenne',
       'Meurthe', 'Meuse', 'Morbihan', 'Moselle', 'Nievre', 'Nord', 'Oise',
       'Orne', 'Pas-de-Calais', 'Puy-de-Dome', 'Basses-Pyrenees',
       'Hautes-Pyrenees', 'Pyrenees-Orientales', 'Bas-Rhin', 'Haut-Rhin',
       'Rhone', 'Haute-Saone', 'Saone-et-Loire', 'Sarthe', 'Seine',
       'Seine-Inferieure', 'Seine-et-Marne', 'Seine-et-Oise', 'Deux-Sevres',
       'Somme', 'Tarn', 'Tarn-et-Garonne', 'Var', 'Vaucluse', 'Vendee',
       'Vienne', 'Haute-Vienne', 'Vosges', 'Yonne', 'Corse'],
      dtype='object', name='Department')

dept               0
Region             1
Crime_pers         0
Crime_prop         0
Literacy           0
Donations          0
Infants            0
Suicides           0
MainCity           0
Wealth             0
Commerce           0
Clergy             0
Crime_parents      0
Infanticide        0
Donation_clergy    0
Lottery            0
Desertion          0
Instruction        0
Prostitutes        0
Distance           0
Area               0
Pop1831            0
dtype: int64

dept                 int64
Region              object
Crime_pers           int64
Crime_prop           int64
Literacy             int64
Donations            int64
Infants              int64
Suicides             int64
MainCity            object
Wealth               int64
Commerce             int64
Clergy               int64
Crime_parents        int64
Infanticide          int64
Donation_clergy      int64
Lottery              int64
Desertion            int64
Instruction          int64
Prostitutes          int64
Distance           float64
Area                 int64
Pop1831            float64
dtype: object

Text(0, 0.5, 'Zmienne nezależne ciągłe')

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:57   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

plt.hist(model_fit.resid_pearson)
plt.ylabel('Count')
plt.xlabel('Normalized residuals')

Text(0.5, 0, 'Normalized residuals')

Index(['Ain', 'Aisne', 'Allier', 'Basses-Alpes', 'Hautes-Alpes', 'Ardeche',
       'Ardennes', 'Ariege', 'Aube', 'Aude', 'Aveyron', 'Bouches-du-Rhone',
       'Calvados', 'Cantal', 'Charente', 'Charente-Inferieure', 'Cher',
       'Correze', 'Cote-d'Or', 'Cotes-du-Nord', 'Creuse', 'Dordogne', 'Doubs',
       'Drome', 'Eure', 'Eure-et-Loir', 'Finistere', 'Gard', 'Haute-Garonne',
       'Gers', 'Gironde', 'Herault', 'Ille-et-Vilaine', 'Indre',
       'Indre-et-Loire', 'Isere', 'Jura', 'Landes', 'Loir-et-Cher', 'Loire',
       'Haute-Loire', 'Loire-Inferieure', 'Loiret', 'Lot', 'Lot-et-Garonne',
       'Lozere', 'Maine-et-Loire', 'Manche', 'Marne', 'Haute-Marne', 'Mayenne',
       'Meurthe', 'Meuse', 'Morbihan', 'Moselle', 'Nievre', 'Nord', 'Oise',
       'Orne', 'Pas-de-Calais', 'Puy-de-Dome', 'Basses-Pyrenees',
       'Hautes-Pyrenees', 'Pyrenees-Orientales', 'Bas-Rhin', 'Haut-Rhin',
       'Rhone', 'Haute-Saone', 'Saone-et-Loire', 'Sarthe', 'Seine',
       'Seine-Inferieure', 'Seine-et-Marne', 'Seine-et-Oise', 'Deux-Sevres',
       'Somme', 'Tarn', 'Tarn-et-Garonne', 'Var', 'Vaucluse', 'Vendee',
       'Vienne', 'Haute-Vienne', 'Vosges', 'Yonne', 'Corse'],
      dtype='object', name='Department')

dept               0
Region             1
Crime_pers         0
Crime_prop         0
Literacy           0
Donations          0
Infants            0
Suicides           0
MainCity           0
Wealth             0
Commerce           0
Clergy             0
Crime_parents      0
Infanticide        0
Donation_clergy    0
Lottery            0
Desertion          0
Instruction        0
Prostitutes        0
Distance           0
Area               0
Pop1831            0
dtype: int64

dept                 int64
Region              object
Crime_pers           int64
Crime_prop           int64
Literacy             int64
Donations            int64
Infants              int64
Suicides             int64
MainCity            object
Wealth               int64
Commerce             int64
Clergy               int64
Crime_parents        int64
Infanticide          int64
Donation_clergy      int64
Lottery              int64
Desertion            int64
Instruction          int64
Prostitutes          int64
Distance           float64
Area                 int64
Pop1831            float64
dtype: object

Text(0, 0.5, 'Zmienne nezależne ciągłe')

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:57   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

from statsmodels.graphics.regressionplots import *

plot_leverage_resid2(model_fit)
plt.axvline(x= 4, color = 'r', linestyle='--', lw=0.9)

Index(['Ain', 'Aisne', 'Allier', 'Basses-Alpes', 'Hautes-Alpes', 'Ardeche',
       'Ardennes', 'Ariege', 'Aube', 'Aude', 'Aveyron', 'Bouches-du-Rhone',
       'Calvados', 'Cantal', 'Charente', 'Charente-Inferieure', 'Cher',
       'Correze', 'Cote-d'Or', 'Cotes-du-Nord', 'Creuse', 'Dordogne', 'Doubs',
       'Drome', 'Eure', 'Eure-et-Loir', 'Finistere', 'Gard', 'Haute-Garonne',
       'Gers', 'Gironde', 'Herault', 'Ille-et-Vilaine', 'Indre',
       'Indre-et-Loire', 'Isere', 'Jura', 'Landes', 'Loir-et-Cher', 'Loire',
       'Haute-Loire', 'Loire-Inferieure', 'Loiret', 'Lot', 'Lot-et-Garonne',
       'Lozere', 'Maine-et-Loire', 'Manche', 'Marne', 'Haute-Marne', 'Mayenne',
       'Meurthe', 'Meuse', 'Morbihan', 'Moselle', 'Nievre', 'Nord', 'Oise',
       'Orne', 'Pas-de-Calais', 'Puy-de-Dome', 'Basses-Pyrenees',
       'Hautes-Pyrenees', 'Pyrenees-Orientales', 'Bas-Rhin', 'Haut-Rhin',
       'Rhone', 'Haute-Saone', 'Saone-et-Loire', 'Sarthe', 'Seine',
       'Seine-Inferieure', 'Seine-et-Marne', 'Seine-et-Oise', 'Deux-Sevres',
       'Somme', 'Tarn', 'Tarn-et-Garonne', 'Var', 'Vaucluse', 'Vendee',
       'Vienne', 'Haute-Vienne', 'Vosges', 'Yonne', 'Corse'],
      dtype='object', name='Department')

dept               0
Region             1
Crime_pers         0
Crime_prop         0
Literacy           0
Donations          0
Infants            0
Suicides           0
MainCity           0
Wealth             0
Commerce           0
Clergy             0
Crime_parents      0
Infanticide        0
Donation_clergy    0
Lottery            0
Desertion          0
Instruction        0
Prostitutes        0
Distance           0
Area               0
Pop1831            0
dtype: int64

dept                 int64
Region              object
Crime_pers           int64
Crime_prop           int64
Literacy             int64
Donations            int64
Infants              int64
Suicides             int64
MainCity            object
Wealth               int64
Commerce             int64
Clergy               int64
Crime_parents        int64
Infanticide          int64
Donation_clergy      int64
Lottery              int64
Desertion            int64
Instruction          int64
Prostitutes          int64
Distance           float64
Area                 int64
Pop1831            float64
dtype: object

Text(0, 0.5, 'Zmienne nezależne ciągłe')

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:57   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:58   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Guerry, Essay on the Moral Statistics of France (1833)

Source: http://datavis.ca/gallery/guerry/guerrydat.html

Num Zmienna Rodzaj Etykieta / opis / źródło

dept wydział Num ID działu Standardowe numery dla działów, z wyjątkiem Korsyki (200).
Region Region Char (1) Region Region Francji („N” = „Północ”, „S” = „Południe”, „E” = „Wschód”, „W” = „Zachód”, „C” = „Środek”). Korsyka ma kod „” (brak, NA)
Department Departament Char (25) Nazwa departamentu Nazwy departamentów są nazywane zgodnie z użyciem w 1830 roku. Zobacz depts.txt, aby uzyskać alternatywne kodowanie nazw departamentów.
Crime_pers Crime_pers Num Muzyka pop. za przestępstwa przeciwko osobom Źródło: A2 (Compte général, 1825-1830)
Crime_prop Crime_prop Num Muzyka pop. za przestępstwo przeciwko mieniu Źródło: A2 (Compte général, 1825-1830)
Literacy Alfabetyzacja Num Procent odczytu i zapisu Procent poborowych wojskowych, którzy potrafią czytać i pisać Źródło: A2
Donations Darowizny Num Darowizny dla biednych Źródło: A2 (Bulletin des lois)
Infants Niemowlęta Num Muzyka pop. na nielegalne urodzenie Źródło: A2 (Bureaau des Longitudes, 1817-1821)
Suicides Samobójstwa Num Muzyka pop. na samobójstwo Źródło: A2 (Compte général, 1827-1830)
MainCity Główne Miasto Char (5) Rozmiar głównego miasta Rozmiar głównego miasta („1: Sm”, „2: Med”, „3: Lg”), stosowany jako surogat dla gęstości zalewania. Duże odnosi się do górnej 10, małe do dolnej 10; wszystkie pozostałe są sklasyfikowane jako średnie. Źródło: A1
Wealth Bogactwo Ranga Podatek per capita od nieruchomości osobistych Indeks rankingowy oparty na podatkach od nieruchomości osobistych i ruchomych na mieszkańca Źródło: A1
Commerce Handel Ranga Handel i przemysł Handel i przemysł mierzony rangą liczby patentów / populacji Źródło: A1
Clergy Kler Ranga Dystrybucja duchowieństwa Dystrybucja duchowieństwa mierzona stopniem liczby księży katolickich w czynnej służbie / ludności Źródło: A1 (Almanach officiel du clergy, 1829)
Crime_parents Kryminalni rodzice Ranga Przestępstwa przeciwko rodzicom Przestępstwa przeciwko rodzicom, mierzone stopniem stosunku przestępstw przeciwko rodzicom do wszystkich przestępstw – Średnia z lat 1825–1830 Źródło: A1 (Compte général)
Infanticide Dzieciobójca Ranga Liczba dzieciobójstw na jednego mieszkańca Wskaźnik wskaźnikowy liczby dzieciobójstw do liczby ludności – Średnia z lat 1825–1830 Źródło: A1 (Compte général)
Donation_clergy Duchowieństwo Ranga Darowizny dla duchowieństwa Wskaźniki rankingowe liczby zapisów i darowizn między żywymi dla ludności – Średnia dla lat 1815–1824 Źródło: A1 (Bull. Des lois, ordunn. D’autorization)
Lottery Loteria Ranga Zakład per capita w Royal Lottery Ranking rankingowy wpływów z zakładów loterii królewskiej do liczby ludności — Średnia z lat 1822–1826 Źródło: A1 (Compte rendus par le ministre des finances)
Desertion Dezercja Ranga Dezercja wojskowa Desercja wojskowa, stosunek liczby młodych żołnierzy oskarżonych o dezercję do siły kontyngentu wojskowego, minus deficyt spowodowany niewystarczalnością dostępnych polan – średnia z lat 1825–1827. Źródło: A1 (Compte du ministere du guerre, 1829 etat V)
Instruction Instrukcja Ranga Rankingi instrukcji zapisane z mapy instrukcji Guerry’ego. Uwaga: jest to odwrotnie związane z umiejętnością czytania i pisania (zgodnie z definicją tutaj).
Prostitutes Prostytutki Num Prostytutki w Paryżu Liczba prostytutek zarejestrowanych w Paryżu w latach 1816–1834, sklasyfikowanych według departamentu ich urodzenia Źródło: Parent-Duchatelet (1836), De la prostitution en Paris
Distance Dystans Num Odległość do Paryża (km) Odległość każdego centroidu departamentu do centroidu Sekwany (Paryż) Źródło: obliczone na podstawie centroidów departamentów
Area Powierzchnia Num Obszar (1000 km^2) Źródło: Angeville (1836)
Pop1831 Pop1831 Num 1831 populacja ludności w 1831 roku, pochodzi z Angeville (1836), Essai sur la Statistique de la Populacja français w 1000s.

Tworzymy model analizujący ilość czynów kriminalnych (Crime_pers + Crime_prop) na podstawie poziomu wykształcenia (Literacy) oraz liczbą księży katolickich w czynnej służbie (Clergy) oraz poziomu prostytucji (Prostitutes).

df2 = pd.read_csv('c:/1/Guerry.csv', index_col='Department')
del df2['Unnamed: 0']
print(df2.index)
df2.head(4)

Index(['Ain', 'Aisne', 'Allier', 'Basses-Alpes', 'Hautes-Alpes', 'Ardeche',
       'Ardennes', 'Ariege', 'Aube', 'Aude', 'Aveyron', 'Bouches-du-Rhone',
       'Calvados', 'Cantal', 'Charente', 'Charente-Inferieure', 'Cher',
       'Correze', 'Cote-d'Or', 'Cotes-du-Nord', 'Creuse', 'Dordogne', 'Doubs',
       'Drome', 'Eure', 'Eure-et-Loir', 'Finistere', 'Gard', 'Haute-Garonne',
       'Gers', 'Gironde', 'Herault', 'Ille-et-Vilaine', 'Indre',
       'Indre-et-Loire', 'Isere', 'Jura', 'Landes', 'Loir-et-Cher', 'Loire',
       'Haute-Loire', 'Loire-Inferieure', 'Loiret', 'Lot', 'Lot-et-Garonne',
       'Lozere', 'Maine-et-Loire', 'Manche', 'Marne', 'Haute-Marne', 'Mayenne',
       'Meurthe', 'Meuse', 'Morbihan', 'Moselle', 'Nievre', 'Nord', 'Oise',
       'Orne', 'Pas-de-Calais', 'Puy-de-Dome', 'Basses-Pyrenees',
       'Hautes-Pyrenees', 'Pyrenees-Orientales', 'Bas-Rhin', 'Haut-Rhin',
       'Rhone', 'Haute-Saone', 'Saone-et-Loire', 'Sarthe', 'Seine',
       'Seine-Inferieure', 'Seine-et-Marne', 'Seine-et-Oise', 'Deux-Sevres',
       'Somme', 'Tarn', 'Tarn-et-Garonne', 'Var', 'Vaucluse', 'Vendee',
       'Vienne', 'Haute-Vienne', 'Vosges', 'Yonne', 'Corse'],
      dtype='object', name='Department')

dept               0
Region             1
Crime_pers         0
Crime_prop         0
Literacy           0
Donations          0
Infants            0
Suicides           0
MainCity           0
Wealth             0
Commerce           0
Clergy             0
Crime_parents      0
Infanticide        0
Donation_clergy    0
Lottery            0
Desertion          0
Instruction        0
Prostitutes        0
Distance           0
Area               0
Pop1831            0
dtype: int64

dept                 int64
Region              object
Crime_pers           int64
Crime_prop           int64
Literacy             int64
Donations            int64
Infants              int64
Suicides             int64
MainCity            object
Wealth               int64
Commerce             int64
Clergy               int64
Crime_parents        int64
Infanticide          int64
Donation_clergy      int64
Lottery              int64
Desertion            int64
Instruction          int64
Prostitutes          int64
Distance           float64
Area                 int64
Pop1831            float64
dtype: object

Text(0, 0.5, 'Zmienne nezależne ciągłe')

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:57   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:58   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

1. Sprawdzamy kompletność i format danych

df2.isnull().sum()

dept               0
Region             1
Crime_pers         0
Crime_prop         0
Literacy           0
Donations          0
Infants            0
Suicides           0
MainCity           0
Wealth             0
Commerce           0
Clergy             0
Crime_parents      0
Infanticide        0
Donation_clergy    0
Lottery            0
Desertion          0
Instruction        0
Prostitutes        0
Distance           0
Area               0
Pop1831            0
dtype: int64

df2.dtypes

dept                 int64
Region              object
Crime_pers           int64
Crime_prop           int64
Literacy             int64
Donations            int64
Infants              int64
Suicides             int64
MainCity            object
Wealth               int64
Commerce             int64
Clergy               int64
Crime_parents        int64
Infanticide          int64
Donation_clergy      int64
Lottery              int64
Desertion            int64
Instruction          int64
Prostitutes          int64
Distance           float64
Area                 int64
Pop1831            float64
dtype: object

Dane nadają się do wykorzystania w procesie budowy modelu regresji liniowej.

2. Tworzymy połączoną zmienną wynikową Crime = Crime_pers + Crime_prop jako wskaźnik skorygowany wielkością lokalnej populacji

df2['Crime'] = ((df2['Crime_pers'] + df2['Crime_prop'])/df2['Pop1831'])
df2[['Crime','Crime_pers', 'Crime_prop','Pop1831']].sample(3)

3. Analizujemy poziom korelacji zmiennych

CORREL = df2.corr().sort_values('Crime')

plt.figure(figsize=(10,6))
CORREL['Crime'].plot(kind='barh', color='red')
plt.title('Korelacja ze zmienną wynikową', fontsize=20)
plt.xlabel('Poziom korelacji')
plt.ylabel('Zmienne nezależne ciągłe')

Text(0, 0.5, 'Zmienne nezależne ciągłe')

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:57   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:58   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Text(0.5, 1.0, 'Regression')

4. Budujemy model regresji liniowej

X = df2['Pop1831']
y = df2['Crime']

model = sm.OLS(y, sm.add_constant(X))
model_fit = model.fit()

print(model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:57   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

C:ProgramDataAnaconda3libsite-packagesnumpycorefromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:58   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Text(0.5, 1.0, 'Regression')

5.

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(18,13))
fig = sm.graphics.influence_plot(model_fit, ax=ax, criterion="cooks")
plt.axhline(y=0, color='green', linestyle='-', lw=0.6)
plt.axhline(y=2, color='r', linestyle='--', lw=0.6)
plt.axhline(y=-2, color='r', linestyle='--', lw=0.6)
plt.axvline(x= 0.05, color = 'r', linestyle='--', lw=0.6)
plt.axvline(x= 0.014, color = 'r', linestyle='--', lw=0.6)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:58   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Text(0.5, 1.0, 'Regression')

from statsmodels.graphics.regressionplots import *

plot_leverage_resid2(model_fit)
plt.axvline(x= 3.5, color = 'r', linestyle='--', lw=0.9)
plt.axhline(y=0.15, color='r', linestyle='--', lw=0.6)

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:58   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Text(0.5, 1.0, 'Regression')

Interpretacja: na wykresach widać, że kilka miast mocno destabilizuje model OLS (wysoka dźwignia i wysoka odległość Cooka).

5. Usuwamy trzy obserwacje destabilizujące proces: 'Nord’, 'Creuse’, 'Hautes-Alpes’, 'Corse’, 'Seine’, 'Indre’, 'Ardennes’

df3=df2.drop(['Nord', 'Creuse', 'Hautes-Alpes', 'Corse', 'Seine', 'Indre', 'Ardennes'],axis=0)

X = df3['Pop1831']
y = df3['Crime']

model2 = sm.OLS(y, sm.add_constant(X))
model_fit2 = model2.fit()

print(model_fit.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Crime   R-squared:                       0.348
Model:                            OLS   Adj. R-squared:                  0.340
Method:                 Least Squares   F-statistic:                     44.83
Date:                Fri, 17 Jan 2020   Prob (F-statistic):           2.28e-09
Time:                        11:56:58   Log-Likelihood:                -409.35
No. Observations:                  86   AIC:                             822.7
Df Residuals:                      84   BIC:                             827.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        133.7714      8.470     15.794      0.000     116.928     150.614
Pop1831       -0.1395      0.021     -6.696      0.000      -0.181      -0.098
==============================================================================
Omnibus:                       25.716   Durbin-Watson:                   1.734
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               53.099
Skew:                           1.080   Prob(JB):                     2.95e-12
Kurtosis:                       6.187   Cond. No.                     1.12e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.12e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Text(0.5, 1.0, 'Regression')

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(18,13))
fig = sm.graphics.influence_plot(model_fit2, ax=ax, criterion="cooks")
plt.axhline(y=0, color='green', linestyle='-', lw=0.6)
plt.axhline(y=2, color='r', linestyle='--', lw=0.6)
plt.axhline(y=-2, color='r', linestyle='--', lw=0.6)
plt.axvline(x= 0.05, color = 'r', linestyle='--', lw=0.6)
plt.axvline(x= 0.014, color = 'r', linestyle='--', lw=0.6)

Text(0.5, 1.0, 'Regression')

Model w zakresie R^2 nieznacznie się poprawił.

X = df2['Pop1831']
y = df2['Crime']

## wykres--------------------------------------###
plt.figure()
plt.scatter(df2.Pop1831, df2.Crime, c = 'r')
plt.plot(df2.Pop1831, model_fit.params[0] + model_fit.params[1] * df2.Pop1831, c = 'r')
plt.scatter(df3.Pop1831, df3.Crime, c = 'green')
plt.plot(df3.Pop1831, model_fit2.params[0] + model_fit2.params[1] * df3.Pop1831, c = 'green')
plt.xlabel('Local population')
plt.ylabel('Crime')
plt.title("Regression")

Text(0.5, 1.0, 'Regression')

Na wykresie

czerwona linia i czerwone punkty symbolizują zbiór danych przed eliminacją obserwacji odstających.
punkty i linia zielona pokazuje zbiór danych i linie regresji po eliminacji odstających punktów.

Source of data: http://faculty.webster.edu/woolflm/8canswer.html

Model ANOVA two-way (two-factor)

W tej analizie robimy porównie dwóch czyników w tych samych populacjach.

Analiza ANOVA jest dla H01 i H02 dwiema ANOVA jednoczynnikowymi jednak przy H03 jest już połączeniem tych dwóch ANOVA one-way czyli ANOVA two-way.

Przeprowadzono badania, których celem było określenie poziomu radości życia. Postawiono następujące hipotezy zerowe:

h0a – średnia radość życia jest taka sama dla kobiet i mężczyzn
H0b – średnia radość życia jest taka sama wśród trzech grup: 'Young Adult’, 'Middle Adult’, 'Older Adult’.
H0c – nie istnieje interakcji pomiędzy grupami: 'Young Adult’, 'Middle Adult’, 'Older Adult’ a płcią badanych osób mającej wpływ na poziom radość życia.

Kodujemy bazwy grup:

’Young Adult’ :1
’Middle Adult’:2
’Older Adult’ :3

import pandas as pd

cc =['Male', 'Male', 'Male', 'Male','Male','Female','Female','Female','Female','Female']
aaa	 = [4,2,3,4,2,7,4,3,6,5]
bbb = [7,5,7,5,6,8,10,7,7,8]
ccc = [10,7,9,8,11,10,9,12,11,13]
df = pd.DataFrame({'Group': cc, 1:aaa, 2:bbb, 3:ccc })
df

Badaliśy już ten zbiór danych pod kontem możliwości wykorzystania go w analizie ANOVA.

Próby zostały pobrane niezależnie od siebie z każdej z r populacji
W każdej z r badanych populacji rozkłąd jest normalny o tej samej wariancji. Średnie w r populacjach mogą być równę, lecz nie muszą*

Analiza ANOVA tej próby jest opisana w linku poniżej:
http://sigmaquality.pl/python/praktyczny-przyklad-wykorzystania-anova-two-way-w-pythonie-pl080120201039/

Przekształacamy dataframe do postaci zgodniej ze statmodels.

df_melt = pd.melt(df.reset_index(),  id_vars=['Group'], value_vars=[1, 2, 3])
df_melt.columns = ['Sex','Age','value']
df_melt.sample(5)

df_melt['Age'] = df_melt['Age'].astype('int64')
df_melt.dtypes

Sex      object
Age       int64
value     int64
dtype: object

Teraz tworzymy dwa modele regresji wielorakiej OLS.

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

Reg1 = ols(formula = "value ~ Age + Sex", data = df_melt)
Fit1 = Reg1.fit()

Reg2 = ols(formula = "value ~ Age*Sex", data = df_melt)
Fit2 = Reg2.fit()

print(df_melt['value'].dtype)
print(df_melt['Age'].dtype)
print(df_melt['Sex'].dtype)

int64
int64
object

from statsmodels.graphics.factorplots import interaction_plot


fig = interaction_plot(df_melt['Age'], df_melt['Sex'],df_melt['value'],
             colors=['black','darkgreen'], markers=['D','x'], linestyles=["-", "--"], ylabel='radość życia', xlabel='Grupa wiekowa')
fig = interaction_plot(df_melt['Age'],df_melt['Sex'], Fit1.fittedvalues,
            colors=['black','darkgreen'], markers=['D','x'],linestyles=["-", "--"], ylabel='radość życia - OLS', xlabel='Grupa wiekowa')
fig = interaction_plot(df_melt['Age'],df_melt['Sex'], Fit2.fittedvalues,
             colors=['black','darkgreen'], markers=['D','x'], linestyles=["-", "--"], ylabel='radość życia - OLS interakcja', xlabel='Grupa wiekowa')

import matplotlib.pyplot as plt
plt.show()

Interpretacja wykresu interakcji

Wykres interakcji pokazuje w jaki sposób związek między jednym czynnikiem kategorycznym (płeć) a ciągłą reakcją zależy od wartości drugiego czynnika kategorycznego (grupa wiekowa). Ten wykres wyświetla średnie dla poziomów jednego czynnika na osi y oraz osobną linię dla każdego czynnika.

Równoległe linie oznaczają brak interakcji.
Linie nierównoległe – występuje interakcja.

Im bardziej nierównoległe są linie, tym większa jest siła interakcji.
Chociaż można użyć tego wykresu do wyświetlenia efektów, należy wykonać odpowiedni test ANOVA i ocenić istotność statystyczną efektów. Jeśli efekty interakcji są znaczące, nie można interpretować głównych efektów bez uwzględnienia efektów interakcji.

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	17.3567	5.772	3.007	0.003	5.875	28.838
Commerce	0.3788	0.095	4.001	0.000	0.190	0.567
Prostitutes	-0.0099	0.004	-2.204	0.030	-0.019	-0.001
Donation_clergy	0.2605	0.092	2.846	0.006	0.078	0.443

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	17.3567	5.772	3.007	0.003	5.875	28.838
Commerce	0.3788	0.095	4.001	0.000	0.190	0.567
Prostitutes	-0.0099	0.004	-2.204	0.030	-0.019	-0.001
Donation_clergy	0.2605	0.092	2.846	0.006	0.078	0.443

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	32.2955	7.206	4.482	0.000	17.956	46.635
Commerce	0.2385	0.100	2.379	0.020	0.039	0.438
Prostitutes	-0.0808	0.023	-3.497	0.001	-0.127	-0.035
Donation_clergy	0.1857	0.090	2.057	0.043	0.006	0.365

	Unnamed: 0	Region	Happiness Rank	Happiness Score	Economy (GDP per Capita)	Family	Health (Life Expectancy)	Freedom	Trust (Government Corruption)	Generosity	Dystopia Residual	Year
Country
Afghanistan	0	Southern Asia	153.0	3.575	0.31982	0.30285	0.30335	0.23414	0.09719	0.36510	1.95210	2015.0
Albania	1	Central and Eastern Europe	95.0	4.959	0.87867	0.80434	0.81325	0.35733	0.06413	0.14272	1.89894	2015.0
Algeria	2	Middle East and Northern Africa	68.0	5.605	0.93929	1.07772	0.61766	0.28579	0.17383	0.07822	2.43209	2015.0

	Crime	Crime_pers	Crime_prop	Pop1831
Department
Correze	95.685649	15262	12949	294.83
Hautes-Alpes	198.776143	17488	8174	129.10
Lozere	97.613110	7710	5990	140.35

STATMODELS - THE DATA SCIENCE LIBRARY

DIAGNOSTYKA REGRESJI LINIOWEJ część 4. Fit Plot

Guerry, Essay on the Moral Statistics of France (1833)

Teraz będę budował regresjię wieloczynnikową.

Przykład dobrej odpowiedzi modelu: zmienna 'Donation_clergy’¶

Przykład złej odpowiedzi modelu: zmienna 'Prostitutes’

DIAGNOSTYKA REGRESJI LINIOWEJ część 3. Partial Regression Plots

Guerry, Essay on the Moral Statistics of France (1833)

Teraz będę budował regresjię wieloczynnikową.¶

Partial Regression Plots (Duncan plot)¶

Jak wyglądał by wykres Partial Regression Plots gdyby analizować tylko poziom 'Prostitutes’ bez kontekstu na 'Commerce’?¶

DIAGNOSTYKA REGRESJI LINIOWEJ część 2. Wykres wpływu

In [1]:

1. Sprawdzamy czy dane są kompletne i jaki mają format

2. Sprawdzamy korelację zmiennych niezależnych ze zmienną zależną¶

3. Robimy model OLS¶

Odległość Cooka

INTERPRETACJA: Somalia, Mozambique oraz SomaliandRegion to bardzo wpływowe dane ponieważ bomble są wielki (czym większe ty większe odstawanie od danych wg Cooka). Na wykresie widać, że wartości rezydualne dla tych krajów bardzo odstają od średniej.

Wymienione kraje mają również wysoką dźwignię. Czyli można wysnuć wniosek że kraje te destabilizują model. To tak zwane wyjątki potwierdzające regułę.

Obserwacja z dużej odległości nie będzie miała tak dużego wpływu, jeśli jej dźwignia będzie niska. To połączenie dźwigni obserwacji i odległości determinuje jej wpływ.

Leverage

Guerry, Essay on the Moral Statistics of France (1833)

Tworzymy model analizujący ilość czynów kriminalnych (Crime_pers + Crime_prop) na podstawie poziomu wykształcenia (Literacy) oraz liczbą księży katolickich w czynnej służbie (Clergy) oraz poziomu prostytucji (Prostitutes).

1. Sprawdzamy kompletność i format danych

2. Tworzymy połączoną zmienną wynikową Crime = Crime_pers + Crime_prop jako wskaźnik skorygowany wielkością lokalnej populacji

3. Analizujemy poziom korelacji zmiennych

4. Budujemy model regresji liniowej

Interpretacja: na wykresach widać, że kilka miast mocno destabilizuje model OLS (wysoka dźwignia i wysoka odległość Cooka).

5. Usuwamy trzy obserwacje destabilizujące proces: 'Nord’, 'Creuse’, 'Hautes-Alpes’, 'Corse’, 'Seine’, 'Indre’, 'Ardennes’

Graficzny przykład ANOVA two-way – interaction_plot

Model ANOVA two-way (two-factor)

Interpretacja wykresu interakcji

	dept	Region	Crime_pers	Crime_prop	Literacy	Donations	Infants	Suicides	MainCity	Wealth	…	Crime_parents	Infanticide	Donation_clergy	Lottery	Desertion	Instruction	Prostitutes	Distance	Area	Pop1831
Department
Ain	1	E	28870	15890	37	5098	33120	35039	2:Med	73	…	71	60	69	41	55	46	13	218.372	5762	346.03
Aisne	2	N	26226	5521	51	8901	14572	12831	2:Med	22	…	4	82	36	38	82	24	327	65.945	7369	513.00
Allier	3	C	26747	7925	13	10973	17044	114121	2:Med	61	…	46	42	76	66	16	85	34	161.927	7340	298.26
Basses-Alpes	4	E	12935	7289	46	2733	23018	14238	1:Sm	76	…	70	12	37	80	32	29	2	351.399	6925	155.90

Dep. Variable:	Wealth	R-squared:	0.335
Model:	OLS	Adj. R-squared:	0.311
Method:	Least Squares	F-statistic:	13.78
Date:	Fri, 17 Jan 2020	Prob (F-statistic):	2.33e-07
Time:	14:39:43	Log-Likelihood:	-380.69
No. Observations:	86	AIC:	769.4
Df Residuals:	82	BIC:	779.2
Df Model:	3
Covariance Type:	nonrobust

Omnibus:	2.237	Durbin-Watson:	1.588
Prob(Omnibus):	0.327	Jarque-Bera (JB):	1.776
Skew:	0.194	Prob(JB):	0.412
Kurtosis:	2.412	Cond. No.	1.39e+03

Omnibus:	0.823	Durbin-Watson:	1.630
Prob(Omnibus):	0.663	Jarque-Bera (JB):	0.901
Skew:	-0.131	Prob(JB):	0.637
Kurtosis:	2.566	Cond. No.	461.

	Group	1	2	3
0	Male	4	7	10
1	Male	2	5	7
2	Male	3	7	9
3	Male	4	5	8
4	Male	2	6	11
5	Female	7	8	10
6	Female	4	10	9
7	Female	3	7	12
8	Female	6	7	11
9	Female	5	8	13