Rozważania teoretyczne na temat znaczenia (szkodliwości) wspołiniowości zmiennych niezależnych (multicollinearity) zostały wyjaśnione w poprzedniej części.

Problem występowania współliniowości zmiennych niezależnych w regresji logistycznej. Przykład użycia testu VIF.

Tym razem będziemy analizowali zagadnienie zanieczyszczenia miasta.
Źródło danych można znaleźć tutaj.

import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)

Na początku sprawdzimy kompletność danych oraz ich format.

del df['Unnamed: 0']
df.isnull().sum()

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64

df.dtypes

Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

Dane są kompletne i mają właściwy format do prowadzenia dalszej analizy.

Zadanie

Naszym zadaniem jest podział na trzy klasy zanieczyszczenia oznaczonego jako PT08.S5(03). Następnie musimy zbudować model klasyfikacji logistycznej oparty na pozostałych zmiennych w postaci poziomów zanieczyszczeń.

Korelacja zmiennych niezależnych ze zmienną zależną.

CORREL = df.corr()

import seaborn as sns
sns.heatmap(CORREL, annot=True, cbar=False, cmap="coolwarm")

Powyższa macierz wskazuje, że istnieje współliniowość, która negatywnie wpłynie na poziom klasyfikacji. Nie zważając na to kontynujemy realizację naszego zadania. Poniżej przeprowadziłem analizę korelacji zmiennej zależnej: 'PT08.S5(O3)’ ze zmiennymi niezależnymi. W wiekszości występuje wysoki poziom korelacji co jest zjawiskiem pozytywnym.

CORREL2 = df.corr().sort_values('PT08.S5(O3)')
CORREL2['PT08.S5(O3)'].plot(kind='bar')

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

dtype('O')

CO_GT                   float64
PT08_S1_CO              float64
C6H6_GT                 float64
PT08_S2_NMHC            float64
NOx_GT                  float64
PT08_S3_NOx             float64
NO2_GT                  float64
PT08_S4_NO2             float64
T                       float64
RH                      float64
AH                      float64
Categores_PT08_S5_O3     object
dtype: object

Zbiór X treningowy:  (6269, 11)
Zbiór X testowy:     (3088, 11)
Zbiór y treningowy:  (6269,)
Zbiór y testowy:     (3088,)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1414,  131],
       [ 176, 1367]], dtype=int64)

                  precision    recall  f1-score   support

(220.999, 964.0]       0.89      0.92      0.90      1545
 (964.0, 2523.0]       0.91      0.89      0.90      1543

        accuracy                           0.90      3088
       macro avg       0.90      0.90      0.90      3088
    weighted avg       0.90      0.90      0.90      3088

[648.230997747899, 11.43831319456862, 7.939972023928728, 46.44356640618427, 66.47134191824114, 5.220873142678388, 5.43267527445646, 3.714154563031008, 14.52279194616348, 14.359778937836424, 8.176237173633366, 10.619321015188527]

df['Categores_PT08.S5(O3)'] = pd.qcut(df['PT08.S5(O3)'],2)
df['Categores_PT08.S5(O3)'].value_counts().to_frame()

Badana kategoria PT08.S5(O3) została podzielona na równe przedziały. Mniejsza o to czy ten podział jest uzasadniony chemicznie. Najważniejsze że teraz możemy utworzyć model klasyfikacji oparty na regresji logistycznej.
Równy podział zbiorów oddala nas od negatywnego zjawiska niezbilansowania zbiorów i koniczności tworzenia oversampling. Dzisiaj nie to jest najważniejsze lecz przećwiczenie radzenia sobie ze zjawiskiej multicollinearity.

Zbudjemy model klasyfikacji tak jakby nie dotyczył nas problem multicollinearity.

Wskazujemy zmienne niezależne:¶

KOT = df[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 
       'T', 'RH', 'AH','Categores_PT08.S5(O3)']]
KOT.columns = ['CO_GT', 'PT08_S1_CO', 'C6H6_GT', 'PT08_S2_NMHC',
       'NOx_GT', 'PT08_S3_NOx', 'NO2_GT', 'PT08_S4_NO2', 
       'T', 'RH', 'AH','Categores_PT08_S5_O3']

Przekształcamy zmienną zależną: Categores_PT08.S5(O3) na zmienną ciągłą.

KOT['Categores_PT08_S5_O3'] = KOT['Categores_PT08_S5_O3'].astype(str)
KOT['Categores_PT08_S5_O3'].dtypes

C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

dtype('O')

CO_GT                   float64
PT08_S1_CO              float64
C6H6_GT                 float64
PT08_S2_NMHC            float64
NOx_GT                  float64
PT08_S3_NOx             float64
NO2_GT                  float64
PT08_S4_NO2             float64
T                       float64
RH                      float64
AH                      float64
Categores_PT08_S5_O3     object
dtype: object

Zbiór X treningowy:  (6269, 11)
Zbiór X testowy:     (3088, 11)
Zbiór y treningowy:  (6269,)
Zbiór y testowy:     (3088,)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1414,  131],
       [ 176, 1367]], dtype=int64)

                  precision    recall  f1-score   support

(220.999, 964.0]       0.89      0.92      0.90      1545
 (964.0, 2523.0]       0.91      0.89      0.90      1543

        accuracy                           0.90      3088
       macro avg       0.90      0.90      0.90      3088
    weighted avg       0.90      0.90      0.90      3088

[648.230997747899, 11.43831319456862, 7.939972023928728, 46.44356640618427, 66.47134191824114, 5.220873142678388, 5.43267527445646, 3.714154563031008, 14.52279194616348, 14.359778937836424, 8.176237173633366, 10.619321015188527]

[200.1579775895654, 5.895905610386816, 5.807909737315139, 3.1029777937073924, 2.7926706710126528, 2.4675790616762594]

from sklearn.model_selection import train_test_split 


y = KOT['Categores_PT08_S5_O3']
X = KOT.drop('Categores_PT08_S5_O3', axis=1)


Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.33, stratify = y, random_state = 148)

KOT.dtypes

CO_GT                   float64
PT08_S1_CO              float64
C6H6_GT                 float64
PT08_S2_NMHC            float64
NOx_GT                  float64
PT08_S3_NOx             float64
NO2_GT                  float64
PT08_S4_NO2             float64
T                       float64
RH                      float64
AH                      float64
Categores_PT08_S5_O3     object
dtype: object

print ('Zbiór X treningowy: ',Xtrain.shape)
print ('Zbiór X testowy:    ', Xtest.shape)
print ('Zbiór y treningowy: ', ytrain.shape)
print ('Zbiór y testowy:    ', ytest.shape)

Zbiór X treningowy:  (6269, 11)
Zbiór X testowy:     (3088, 11)
Zbiór y treningowy:  (6269,)
Zbiór y testowy:     (3088,)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1414,  131],
       [ 176, 1367]], dtype=int64)

                  precision    recall  f1-score   support

(220.999, 964.0]       0.89      0.92      0.90      1545
 (964.0, 2523.0]       0.91      0.89      0.90      1543

        accuracy                           0.90      3088
       macro avg       0.90      0.90      0.90      3088
    weighted avg       0.90      0.90      0.90      3088

[648.230997747899, 11.43831319456862, 7.939972023928728, 46.44356640618427, 66.47134191824114, 5.220873142678388, 5.43267527445646, 3.714154563031008, 14.52279194616348, 14.359778937836424, 8.176237173633366, 10.619321015188527]

[200.1579775895654, 5.895905610386816, 5.807909737315139, 3.1029777937073924, 2.7926706710126528, 2.4675790616762594]

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1387,  158],
       [ 182, 1361]], dtype=int64)

import numpy as np
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)

LR_Grid.fit(Xtrain, ytrain)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1414,  131],
       [ 176, 1367]], dtype=int64)

                  precision    recall  f1-score   support

(220.999, 964.0]       0.89      0.92      0.90      1545
 (964.0, 2523.0]       0.91      0.89      0.90      1543

        accuracy                           0.90      3088
       macro avg       0.90      0.90      0.90      3088
    weighted avg       0.90      0.90      0.90      3088

[648.230997747899, 11.43831319456862, 7.939972023928728, 46.44356640618427, 66.47134191824114, 5.220873142678388, 5.43267527445646, 3.714154563031008, 14.52279194616348, 14.359778937836424, 8.176237173633366, 10.619321015188527]

[200.1579775895654, 5.895905610386816, 5.807909737315139, 3.1029777937073924, 2.7926706710126528, 2.4675790616762594]

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1387,  158],
       [ 182, 1361]], dtype=int64)

                  precision    recall  f1-score   support

(220.999, 964.0]       0.88      0.90      0.89      1545
 (964.0, 2523.0]       0.90      0.88      0.89      1543

        accuracy                           0.89      3088
       macro avg       0.89      0.89      0.89      3088
    weighted avg       0.89      0.89      0.89      3088

Ocena modelu regresji logistycznej bez redukcji zjawiska multicollinearity¶

ypred = LR_Grid.predict(Xtest)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(ytest, ypred)
co_matrix

array([[1414,  131],
       [ 176, 1367]], dtype=int64)

print(classification_report(ytest, ypred))

                  precision    recall  f1-score   support

(220.999, 964.0]       0.89      0.92      0.90      1545
 (964.0, 2523.0]       0.91      0.89      0.90      1543

        accuracy                           0.90      3088
       macro avg       0.90      0.90      0.90      3088
    weighted avg       0.90      0.90      0.90      3088

[648.230997747899, 11.43831319456862, 7.939972023928728, 46.44356640618427, 66.47134191824114, 5.220873142678388, 5.43267527445646, 3.714154563031008, 14.52279194616348, 14.359778937836424, 8.176237173633366, 10.619321015188527]

[200.1579775895654, 5.895905610386816, 5.807909737315139, 3.1029777937073924, 2.7926706710126528, 2.4675790616762594]

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1387,  158],
       [ 182, 1361]], dtype=int64)

                  precision    recall  f1-score   support

(220.999, 964.0]       0.88      0.90      0.89      1545
 (964.0, 2523.0]       0.90      0.88      0.89      1543

        accuracy                           0.89      3088
       macro avg       0.89      0.89      0.89      3088
    weighted avg       0.89      0.89      0.89      3088

Model wykazuje się doskonałymi zdolnościmi prognostycznymi w klasyfikacji zanieczyszczenia. Nie oznacza to jednak, że wszystko jest doskonale. Wśród zmiennych opisujących występuje wysoka multicollinearity. Jest to niekorzystne zjawiski, które teraz wyelinimujemy.

Eliniminacja zjawiska multicollinearity¶

Variance Inflation Factor (VIF)¶

Variance Inflation Factor (VIF) to miara multicollinearity między zmiennymi predykcyjnymi w regresji wielokrotnej. Określa ilościowo nasilenie multicollinearity w zwykłej analizie regresji metodą najmniejszych kwadratów. Zapewnia wskaźnik, który mierzy, o ile wariancja (kwadrat odchylenia standardowego oszacowania) szacowanego współczynnika regresji jest zwiększona z powodu kolinearności.

Kroki wdrażania VIF

Uruchom regresję wieloraką.
Oblicz współczynniki VIF.
Sprawdź współczynniki dla każdej zmiennej predykcyjnej, jeśli VIF wynosi między 5-10, prawdopodobnie występuje multicollinearity i powinieneś rozważyć usunięcie tej zmiennej.

Czyli przed rozpoczęciem pracy nad modelem regresji logistycznej należy utworzyć model regresji wielorakiej i przez VIF wyselekcjonować, wybrać dane do modelu regresji logistycznej.

Jak wspominają tutaj: https://www.ibm.com/support/pages/multicollinearity-diagnostics-logistic-regression-nomreg-or-plum

Procedury regresji dla zmiennych zależnych kategorycznie nie mają diagnostyki kolinearności. W tym celu można jednak użyć procedury regresji liniowej. Statystyka kolinearności w regresji dotyczy relacji między predyktorami, ignorując zmienną zależną. Możesz więc uruchomić REGRESSION z tą samą listą predyktorów i zmiennej zależnej, jakiej chcesz użyć w REGRESJI LOGISTYCZNEJ (na przykład) i zażądać diagnostyki kolinearności. Uruchom regresję logistyczną, aby uzyskać właściwe współczynniki, przewidywane prawdopodobieństwa itp. Po podjęciu niezbędnych decyzji (porzucenie predyktorów itp.) Wynikających z analizy kolinearności.

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.formula.api as smf

lm = smf.ols(formula = 'Categores_PT08_S5_O3 ~ CO_GT+PT08_S1_CO+C6H6_GT+PT08_S2_NMHC+NOx_GT+PT08_S3_NOx+NO2_GT+PT08_S4_NO2+T+RH+AH', data = KOT).fit()
y, X = dmatrices('Categores_PT08_S5_O3 ~ CO_GT+PT08_S1_CO+C6H6_GT+PT08_S2_NMHC+NOx_GT+PT08_S3_NOx+NO2_GT+PT08_S4_NO2+T+RH+AH', data = KOT, return_type = "dataframe")
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)

[648.230997747899, 11.43831319456862, 7.939972023928728, 46.44356640618427, 66.47134191824114, 5.220873142678388, 5.43267527445646, 3.714154563031008, 14.52279194616348, 14.359778937836424, 8.176237173633366, 10.619321015188527]

[200.1579775895654, 5.895905610386816, 5.807909737315139, 3.1029777937073924, 2.7926706710126528, 2.4675790616762594]

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1387,  158],
       [ 182, 1361]], dtype=int64)

                  precision    recall  f1-score   support

(220.999, 964.0]       0.88      0.90      0.89      1545
 (964.0, 2523.0]       0.90      0.88      0.89      1543

        accuracy                           0.89      3088
       macro avg       0.89      0.89      0.89      3088
    weighted avg       0.89      0.89      0.89      3088

vif  =  np.round(vif, decimals=2) 
vif = list(map(float, vif))
name = list(X)

s1=pd.Series(name,name='name')
s2=pd.Series( vif,name='vif')

RFE_list = pd.concat([s1,s2], axis=1)

RFE_list

Interpretacja: wynik w postaci wektora reprezentuje zmienną w określonej kolejności jak w modelu. W zaleceniu VIF wskazano, że jeśli współczynnik przypisany do zmiennej jest większy niż 5, zmienna ta jest wysoce skorelowana z innymi zmiennymi i powinna zostać wyeliminowana z modelu.

Test wykazał, że zmienne: density oraz factorA powinny zostać usunięta z modelu.
jeszcze raz oglądamy macierz korelacji

Jeszcze raz tworzymy macierz korelacji vif i sprawdzamy zmienne po wyeliminowaniu:¶

C6H6_GT
PT08_S2_NMHC
PT08_S4_NO2Eliminuje też zmienne gdzie korelacja ze zmienną zależną było bardzo małe a zmienne miały VIF powyżej 5:
T
RH
AH

from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.formula.api as smf

lm = smf.ols(formula = 'Categores_PT08_S5_O3 ~ CO_GT+PT08_S1_CO+NOx_GT+PT08_S3_NOx+NO2_GT', data = KOT).fit()
y, X = dmatrices('Categores_PT08_S5_O3 ~ CO_GT+PT08_S1_CO+NOx_GT+PT08_S3_NOx+NO2_GT', data = KOT, return_type = "dataframe")
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)

[200.1579775895654, 5.895905610386816, 5.807909737315139, 3.1029777937073924, 2.7926706710126528, 2.4675790616762594]

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1387,  158],
       [ 182, 1361]], dtype=int64)

                  precision    recall  f1-score   support

(220.999, 964.0]       0.88      0.90      0.89      1545
 (964.0, 2523.0]       0.90      0.88      0.89      1543

        accuracy                           0.89      3088
       macro avg       0.89      0.89      0.89      3088
    weighted avg       0.89      0.89      0.89      3088

vif  =  np.round(vif, decimals=2) 
vif = list(map(float, vif))
name = list(X)

s1=pd.Series(name,name='name')
s2=pd.Series( vif,name='vif')

RFE_list = pd.concat([s1,s2], axis=1)

RFE_list

Teraz zmienne, pomimo wysokich stanów przekraczających 5, to są: CO_GT i PT08_S1_CO są akceptowane w zakresie poziomu multicollinearity. Tworzę nowy model klasyfikacji na okrojonym zbiorze zmienych opisujących.

KOT2 = KOT[['Categores_PT08_S5_O3', 'CO_GT', 'PT08_S1_CO','NOx_GT', 'PT08_S3_NOx', 'NO2_GT']]

KOT2.sample(3)

Budujemy model klasyfikacji po VIF¶

#from sklearn.model_selection import train_test_split 


y = KOT2['Categores_PT08_S5_O3']
X = KOT2.drop('Categores_PT08_S5_O3', axis=1)


Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.33, stratify = y, random_state = 148)

#import numpy as np
#from sklearn import model_selection
#from sklearn.pipeline import make_pipeline
#from sklearn.linear_model import LogisticRegression
#from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_VIF = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)

LR_VIF.fit(Xtrain, ytrain)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[1387,  158],
       [ 182, 1361]], dtype=int64)

                  precision    recall  f1-score   support

(220.999, 964.0]       0.88      0.90      0.89      1545
 (964.0, 2523.0]       0.90      0.88      0.89      1543

        accuracy                           0.89      3088
       macro avg       0.89      0.89      0.89      3088
    weighted avg       0.89      0.89      0.89      3088

Ocena modelu regresji logistycznej bez redukcji zjawiska multicollinearity¶

ypred = LR_VIF.predict(Xtest)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(ytest, ypred)
co_matrix

array([[1387,  158],
       [ 182, 1361]], dtype=int64)

print(classification_report(ytest, ypred))

                  precision    recall  f1-score   support

(220.999, 964.0]       0.88      0.90      0.89      1545
 (964.0, 2523.0]       0.90      0.88      0.89      1543

        accuracy                           0.89      3088
       macro avg       0.89      0.89      0.89      3088
    weighted avg       0.89      0.89      0.89      3088

Nowy model zbudowany na pięciu niezwiązanych przez multicollinearity jest nieznacznie gorszy (na poziomie 1-2

Introduction

Logistic regression is algorithm of classification machine learning. Model predicts binary state of dependent variable. Dependent result variable takes value from 0 to 1.

Main principles of logistic regression model

Dependent variable has a binary form.
Model contain only variables who have significant influence on the result.
There are no collinearity among independent variables (no correlation among predictors).
Logistic model needs big number of observations.

Let’s open data with needed libraries. We will be working on the registry of bank cards operations. In colum 'Class’ value 0 mean: lack of fraud in transaction, value 1 point embezzlement. Let’s take an assumption, main aim is to correctly classification of transactions with embezzlement. We can unfortunately certificate good transaction as the transaction with fraud.

## Procedura Logistic Regression DLA ZMIENNYCH OPISUJĄCYCH CIĄGŁYCH 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, Normalizer, scale
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
from sklearn.model_selection import cross_val_score, train_test_split, KFold, StratifiedShuffleSplit, GridSearchCV
from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

df = pd.read_csv('c:/1/creditcard.csv')
df.head(3)

df.columns

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

Analyze of the level of balance of the dependent variables

Assembly of variable is balanced when values 1 and 0 have likely number of occurrences. Result variables assembly is unbalanced because the subject of investigation rare phenomenon (showed as 1). Variable 1 is appeared one of the 1000 transaction marked as 0. When the variable assembly is unbalanced, then we can appear phenomenon, where model would be ignored minority variable (marked as 1). Such model has very high level of 'recall’ ratios, despite ignore even all 1 variables.

We check out level of unbalanced in registry.

df.Class.value_counts(dropna = False)

0.0    17400
1.0       80
NaN        1
Name: Class, dtype: int64

sns.countplot(x='Class',data=df, palette='GnBu_d')
plt.show()

0.0    0.995366
1.0    0.004576
NaN    0.000057
Name: Class, dtype: float64

Class
0.0    66.927028
1.0    98.082375
Name: Amount, dtype: float64

min        0
max     7712
mean      67
std      188
Name: Amount, dtype: int32

V3       -0.524607
V14      -0.482336
V17      -0.474386
V7       -0.441136
V10      -0.393678
V16      -0.369733
V12      -0.349516
V1       -0.315801
V5       -0.297279
V18      -0.235021
V9       -0.204471
V6       -0.136283
V23      -0.044445
V24      -0.034291
V22      -0.026568
V13      -0.016051
V15      -0.006326
Amount    0.011165
V26       0.018321
V28       0.021741
Time      0.022051
V25       0.032994
V19       0.038577
V21       0.048056
V20       0.078527
V27       0.157495
V8        0.233020
V4        0.281417
V2        0.287273
V11       0.309185
Class     1.000000
Name: Class, dtype: float64

Index(['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1', 'Class'], dtype='object')

V3      -0.524607
V14     -0.482336
V17     -0.474386
V7      -0.441136
V10     -0.393678
V16     -0.369733
V12     -0.349516
V1      -0.315801
Class    1.000000
Name: Class, dtype: float64

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

The plot is not very readable. We make the percentage structure.

Analyze of independent variables in the logistic regression model

Let’s remind two important assumptions for logistic regression.

Inmodel, we ought to took only independent variables who have significant influence on the result variable.
Independent variable should be mutual non correlated.

In the first stage we check how independent variables influence on the result variable.

df.Class.value_counts(dropna = False, normalize=True)

0.0    0.995366
1.0    0.004576
NaN    0.000057
Name: Class, dtype: float64

I check what the amount on transaction with fraud and on regular transaction.

df.groupby('Class').Amount.mean()

Class
0.0    66.927028
1.0    98.082375
Name: Amount, dtype: float64

df.Amount.agg(['min','max','mean','std']).astype(int)

min        0
max     7712
mean      67
std      188
Name: Amount, dtype: int32

As we see scale of fraud, represented as 1 in column 'Class’, poses only 0,46

In the first stage we check how independent variables influence on the result variable.

CORREL = df.corr().sort_values('Class')
CORREL['Class']

V3       -0.524607
V14      -0.482336
V17      -0.474386
V7       -0.441136
V10      -0.393678
V16      -0.369733
V12      -0.349516
V1       -0.315801
V5       -0.297279
V18      -0.235021
V9       -0.204471
V6       -0.136283
V23      -0.044445
V24      -0.034291
V22      -0.026568
V13      -0.016051
V15      -0.006326
Amount    0.011165
V26       0.018321
V28       0.021741
Time      0.022051
V25       0.032994
V19       0.038577
V21       0.048056
V20       0.078527
V27       0.157495
V8        0.233020
V4        0.281417
V2        0.287273
V11       0.309185
Class     1.000000
Name: Class, dtype: float64

Correlation vector show, that some of the independent variables have small or none influence on the result.

Statistical analysis of independent variables

We check how is the difference between 1 and 0 in the 'Amount’ and 'Time’ of transactions.

pd.pivot_table(df, index='Class', values = 'Amount', aggfunc= [np.mean, np.median, min, max, np.std] )

pd.pivot_table(df, index='Class', values = 'Time', aggfunc= [np.mean, np.median, min, max, np.std])

We point the columns who have any correlation upper than 0.4 with dependent variable.

kot = CORREL[(CORREL['Class']>0.4)|(CORREL['Class']<-0.3)][['Class']]
kot.index

Index(['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1', 'Class'], dtype='object')

We collected only variables strong correlated with result variable.

CORREL = df[['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1', 'Class']].corr().sort_values('Class')
CORREL['Class']

V3      -0.524607
V14     -0.482336
V17     -0.474386
V7      -0.441136
V10     -0.393678
V16     -0.369733
V12     -0.349516
V1      -0.315801
Class    1.000000
Name: Class, dtype: float64

We rather don’t use 'Amount’ to the model. For example, we do standardization of this category.

#scaler = StandardScaler()
#df['Amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1, 1))

Let’s check correlation among independent variables.

sns.heatmap (df.corr (), cmap="coolwarm")

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

The Best parameter: {'C': 1.0}
The Best estimator: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True)

------Training data---------------------------------------------------
The RECALL Training data:       0.648
The PRECISION Training data:    0.686

------Test data-------------------------------------------------------
The RECALL Test data is:         0.615
The PRECISION Test data is:      0.667

The Confusion Matrix Test data :--------------------------------------
[[5735    8]
 [  10   16]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.67      0.62      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.83      0.81      0.82      5769
weighted avg       1.00      1.00      1.00      5769

array([0, 0, 0, ..., 0, 0, 0])

------Training data---------------------------------------------------
RECALL Training data (new_threshold = 0.1):       0.648
PRECISION Training data (new_threshold = 0.1):    0.686
------Test data-------------------------------------------------------
RECALL Test data (new_threshold = 0.1):         0.731
PRECISION Test data (new_threshold = 0.1):      0.576

The Confusion Matrix Test data (new_threshold = 0.1):-----------------
[[5729   14]
 [   7   19]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.58      0.73      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.79      0.86      0.82      5769
weighted avg       1.00      1.00      1.00      5769

0.0    0.995423
1.0    0.004577
Name: Class, dtype: float64

ytrain = 0:  11657
ytrain = 1:  54

Before we start compose model we will need to do order in data.

We remove columns what we will do not use.
We remove records with the lack (empty) values.
We do not standardized data.
We check statistic parametrs of the data.

#df.drop(columns = 'Time', inplace = True)    
#del df['Amount']                             

df.isnull().sum()                             
df = df.dropna(how='any')                    

df.agg(['min','max','mean','std'])[['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1', 'Class']]

sns.heatmap (df[['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1']].corr(), cmap="YlGnBu", annot=True, cbar=False)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

The Best parameter: {'C': 1.0}
The Best estimator: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True)

------Training data---------------------------------------------------
The RECALL Training data:       0.648
The PRECISION Training data:    0.686

------Test data-------------------------------------------------------
The RECALL Test data is:         0.615
The PRECISION Test data is:      0.667

The Confusion Matrix Test data :--------------------------------------
[[5735    8]
 [  10   16]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.67      0.62      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.83      0.81      0.82      5769
weighted avg       1.00      1.00      1.00      5769

array([0, 0, 0, ..., 0, 0, 0])

------Training data---------------------------------------------------
RECALL Training data (new_threshold = 0.1):       0.648
PRECISION Training data (new_threshold = 0.1):    0.686
------Test data-------------------------------------------------------
RECALL Test data (new_threshold = 0.1):         0.731
PRECISION Test data (new_threshold = 0.1):      0.576

The Confusion Matrix Test data (new_threshold = 0.1):-----------------
[[5729   14]
 [   7   19]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.58      0.73      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.79      0.86      0.82      5769
weighted avg       1.00      1.00      1.00      5769

0.0    0.995423
1.0    0.004577
Name: Class, dtype: float64

ytrain = 0:  11657
ytrain = 1:  54

216

We can find any significant mutual correlation between independent variables. Before we start creating the model we ought to remove all records with empty cells.

Creating model of logistic regression

We declare where is independent, descriptive variables and dependent, result variable. We point the divide of training and test assemblies.

feature_cols = ['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1']
X = df[feature_cols] 
y = df.Class

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .33, stratify = y, random_state = 148)

We are configure settings for grid.

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)

Explanation for the code:

Parameteres¶

Parameteres = {’C’: np.power(10.0, np.arange(-3, 3))}
array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])
We have tipical setting for the grid.

warm_start

Using 'warm_start=True’ causes utilization of last model setting to next run of the model. Thank to this model speeds up time of finding convergence. Parameter warm_start is useful in making multiple convergences with the same model using various settings.

scoring = 'roc_auc’¶

The ROC plot estimates the best of setting of classification. Finding the area under the ROC curve is the most popular method of evaluation of classification efficiency by the grid.

jobs = 5

Number of tasks running simultaneously

cv = 2

Number of cross verifications.
Model takes the form of the equation:

LR_Grid.fit(Xtrain, ytrain)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

The Best parameter: {'C': 1.0}
The Best estimator: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True)

------Training data---------------------------------------------------
The RECALL Training data:       0.648
The PRECISION Training data:    0.686

------Test data-------------------------------------------------------
The RECALL Test data is:         0.615
The PRECISION Test data is:      0.667

The Confusion Matrix Test data :--------------------------------------
[[5735    8]
 [  10   16]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.67      0.62      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.83      0.81      0.82      5769
weighted avg       1.00      1.00      1.00      5769

array([0, 0, 0, ..., 0, 0, 0])

------Training data---------------------------------------------------
RECALL Training data (new_threshold = 0.1):       0.648
PRECISION Training data (new_threshold = 0.1):    0.686
------Test data-------------------------------------------------------
RECALL Test data (new_threshold = 0.1):         0.731
PRECISION Test data (new_threshold = 0.1):      0.576

The Confusion Matrix Test data (new_threshold = 0.1):-----------------
[[5729   14]
 [   7   19]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.58      0.73      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.79      0.86      0.82      5769
weighted avg       1.00      1.00      1.00      5769

0.0    0.995423
1.0    0.004577
Name: Class, dtype: float64

ytrain = 0:  11657
ytrain = 1:  54

216

11664

We check what of supermatamiters have been chosen as the best by the grid.

print("The Best parameter:",LR_Grid.best_params_)
print("The Best estimator:",LR_Grid.best_estimator_)

The Best parameter: {'C': 1.0}
The Best estimator: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True)

------Training data---------------------------------------------------
The RECALL Training data:       0.648
The PRECISION Training data:    0.686

------Test data-------------------------------------------------------
The RECALL Test data is:         0.615
The PRECISION Test data is:      0.667

The Confusion Matrix Test data :--------------------------------------
[[5735    8]
 [  10   16]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.67      0.62      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.83      0.81      0.82      5769
weighted avg       1.00      1.00      1.00      5769

array([0, 0, 0, ..., 0, 0, 0])

------Training data---------------------------------------------------
RECALL Training data (new_threshold = 0.1):       0.648
PRECISION Training data (new_threshold = 0.1):    0.686
------Test data-------------------------------------------------------
RECALL Test data (new_threshold = 0.1):         0.731
PRECISION Test data (new_threshold = 0.1):      0.576

The Confusion Matrix Test data (new_threshold = 0.1):-----------------
[[5729   14]
 [   7   19]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.58      0.73      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.79      0.86      0.82      5769
weighted avg       1.00      1.00      1.00      5769

0.0    0.995423
1.0    0.004577
Name: Class, dtype: float64

ytrain = 0:  11657
ytrain = 1:  54

216

11664

V3     11711
V14    11711
V17    11711
V7     11711
V10    11711
V16    11711
V12    11711
V1     11711
dtype: int64

V3     54
V14    54
V17    54
V7     54
V10    54
V16    54
V12    54
V1     54
dtype: int64

Evaluation of the classification of the logistic regression

We use diagnostic block, we put it into the code.

print("n------Training data---------------------------------------------------")
print("The RECALL Training data:      ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print("The PRECISION Training data:   ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print()
print("------Test data-------------------------------------------------------")

print("The RECALL Test data is:        ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=3))
print("The PRECISION Test data is:     ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=3))
print()
print("The Confusion Matrix Test data :--------------------------------------")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))

# PLOT
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
#plt.axvline(0.5, color = '#00C851', linestyle = '--')
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()

------Training data---------------------------------------------------
The RECALL Training data:       0.648
The PRECISION Training data:    0.686

------Test data-------------------------------------------------------
The RECALL Test data is:         0.615
The PRECISION Test data is:      0.667

The Confusion Matrix Test data :--------------------------------------
[[5735    8]
 [  10   16]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.67      0.62      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.83      0.81      0.82      5769
weighted avg       1.00      1.00      1.00      5769

array([0, 0, 0, ..., 0, 0, 0])

------Training data---------------------------------------------------
RECALL Training data (new_threshold = 0.1):       0.648
PRECISION Training data (new_threshold = 0.1):    0.686
------Test data-------------------------------------------------------
RECALL Test data (new_threshold = 0.1):         0.731
PRECISION Test data (new_threshold = 0.1):      0.576

The Confusion Matrix Test data (new_threshold = 0.1):-----------------
[[5729   14]
 [   7   19]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.58      0.73      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.79      0.86      0.82      5769
weighted avg       1.00      1.00      1.00      5769

0.0    0.995423
1.0    0.004577
Name: Class, dtype: float64

ytrain = 0:  11657
ytrain = 1:  54

216

11664

V3     11711
V14    11711
V17    11711
V7     11711
V10    11711
V16    11711
V12    11711
V1     11711
dtype: int64

V3     54
V14    54
V17    54
V7     54
V10    54
V16    54
V12    54
V1     54
dtype: int64

V3     11664
V14    11664
V17    11664
V7     11664
V10    11664
V16    11664
V12    11664
V1     11664
dtype: int64

The 'Recall’ gauge increased to the level of 0.62, ratios 'precision’ fall to the level 0.67.

Changing of the threshold on the logistic regression

Changing of the threshold on the logistic regression increase level of 'recall’ in the expense of level of accuracy shown by the 'precision’ ratio. Let’s remind, bank track all fraud and defraud nations made on the credit cards. Cost of false accusation, when the model call clear transaction as the fraud is relatively small. Let’s remind, bank track all fraud and defraud nations made on the credit cards. Cost of false accusation, when the model call clear transaction as the fraud is relatively small. Bank especially interested in find fraud at all cost.

In the logistic regression model default threshold is 0,5

LR_Grid_ytest = LR_Grid.predict_proba(Xtest)[:, 1]
new_threshold = 0.1 
ytest_pred = (LR_Grid_ytest >= new_threshold).astype(int)

ytest_pred

array([0, 0, 0, ..., 0, 0, 0])

We launch diagnostic module.

print("n------Training data---------------------------------------------------")
print("RECALL Training data (new_threshold = 0.1):      ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print("PRECISION Training data (new_threshold = 0.1):   ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))

print("------Test data-------------------------------------------------------")

print("RECALL Test data (new_threshold = 0.1):        ", np.round(recall_score(ytest, ytest_pred), decimals=3))
print("PRECISION Test data (new_threshold = 0.1):     ", np.round(precision_score(ytest, ytest_pred), decimals=3))
print()
print("The Confusion Matrix Test data (new_threshold = 0.1):-----------------")
print(confusion_matrix(ytest, ytest_pred))
print("----------------------------------------------------------------------")
print(classification_report(ytest, ytest_pred))

# WYKRES-------------------------------------------
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.axvline(0.1, color = '#00C251', linestyle = '--', label = 'threshold = 0.1')
plt.axvline(0.5, color = 'grey', linestyle = '--', label = 'threshold = 0.5')
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')

plt.show()

------Training data---------------------------------------------------
RECALL Training data (new_threshold = 0.1):       0.648
PRECISION Training data (new_threshold = 0.1):    0.686
------Test data-------------------------------------------------------
RECALL Test data (new_threshold = 0.1):         0.731
PRECISION Test data (new_threshold = 0.1):      0.576

The Confusion Matrix Test data (new_threshold = 0.1):-----------------
[[5729   14]
 [   7   19]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.58      0.73      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.79      0.86      0.82      5769
weighted avg       1.00      1.00      1.00      5769

0.0    0.995423
1.0    0.004577
Name: Class, dtype: float64

ytrain = 0:  11657
ytrain = 1:  54

216

11664

V3     11711
V14    11711
V17    11711
V7     11711
V10    11711
V16    11711
V12    11711
V1     11711
dtype: int64

V3     54
V14    54
V17    54
V7     54
V10    54
V16    54
V12    54
V1     54
dtype: int64

V3     11664
V14    11664
V17    11664
V7     11664
V10    11664
V16    11664
V12    11664
V1     11664
dtype: int64

23375

GridSearchCV(cv=6, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

The 'Recall’ gauge increased to the level of 0.73, ratios 'precision’ fall to the level 0.58.

Oversampling

Oversampling is the method of partial removal of effects of unbalanced set of result variables. Method is use for the train set to training the model. When the result registry is not balanced, model have tendency to avoid rare results (1) for the benefit of frequent results (0). Frankly speaking all models have tendency to generalize of the reality. From the bank point of view it is more important to point malversation, despite the model can call proper transaction as fraud, than avoid couple of fraud in the registry. Sensitization of the model for the rare variables is possible by the oversampling and by the move of border of probability. Model sensitization on the frauds move down level of 'Precision’ on the benefit of 'recall’ ratio.

Oversampling by the cloning

Random oversampling consist in supplement minority variables by the copy of this minority variables. Oversampling (copying) can be create more than one times (2x, 3x, 5x, 10x et cetera)

Undersampling by the eliminate

Random undersample consist on elimination of samples from the class of majority (class 0) with the exchange or not. This is one of the earliest techniq of eliminate unbalancing in datasets. Undersampling can increase of variance of the clasificator and teoretycaly trow out usefull variables.
Source of data: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

Procedure of oversampling

Let’s remind how unbalanced is our dataset.

df.Class.value_counts(dropna = False, normalize=True)

0.0    0.995423
1.0    0.004577
Name: Class, dtype: float64

Minority variables poses only 0.46

print("ytrain = 0: ", sum(ytrain == 0))
print("ytrain = 1: ", sum(ytrain == 1))

ytrain = 0:  11657
ytrain = 1:  54

216

11664

V3     11711
V14    11711
V17    11711
V7     11711
V10    11711
V16    11711
V12    11711
V1     11711
dtype: int64

V3     54
V14    54
V17    54
V7     54
V10    54
V16    54
V12    54
V1     54
dtype: int64

V3     11664
V14    11664
V17    11664
V7     11664
V10    11664
V16    11664
V12    11664
V1     11664
dtype: int64

23375

GridSearchCV(cv=6, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

Recall Training data:      0.963
Precision Training data:   0.984
----------------------------------------------------------------------
Recall Test data:          0.9231
Precision Test data:       0.2105
----------------------------------------------------------------------
Confusion Matrix Test data
[[5653   90]
 [   2   24]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99      5743
         1.0       0.21      0.92      0.34        26

   micro avg       0.98      0.98      0.98      5769
   macro avg       0.61      0.95      0.67      5769
weighted avg       1.00      0.98      0.99      5769

216

###### OVERSAMPLING  #######################################

OVS_gauge = sum(ytrain == 0) / sum(ytrain == 1)
OVS_gauge = np.round(OVS_gauge, decimals=0)
OVS_gauge = OVS_gauge.astype(int)
OVS_gauge

216

In the training set for the one fraud falls 216 regular transaction. As we see train set remains with the same proportion to the all set thanks to parameter (stratify = y) in equation who define division in training set and test set. Now we can increase by 217 number of dependent variables marked as 1.

ytrain_pos_OVS = pd.concat([ytrain[ytrain==1]] * OVS_gauge, axis = 0) 
ytrain_pos_OVS.count()

11664

This number is the sum of fraud variables in the training set (there are 54 such variables) multiplied by 216. Now we ought to do it for all independent variables X everywhere where result y was 1.

#Xtrain.loc[ytrain==1, :]

Xtrain.count()

V3     11711
V14    11711
V17    11711
V7     11711
V10    11711
V16    11711
V12    11711
V1     11711
dtype: int64

Xtrain.loc[ytrain==1, :].count()

V3     54
V14    54
V17    54
V7     54
V10    54
V16    54
V12    54
V1     54
dtype: int64

This records with y=1 should be multiplied 216 times.

Xtrain_pos_OVS = pd.concat([Xtrain.loc[ytrain==1, :]] * OVS_gauge, axis = 0)

Now we enter new, additional variables to the training set.

# concat the repeated data with the original data together
ytrain_OVS = pd.concat([ytrain, ytrain_pos_OVS], axis = 0).reset_index(drop = True)
Xtrain_OVS = pd.concat([Xtrain, Xtrain_pos_OVS], axis = 0).reset_index(drop = True)

At the beginning of the study we had 11711 records in dataset.

Xtrain_pos_OVS.count()

V3     11664
V14    11664
V17    11664
V7     11664
V10    11664
V16    11664
V12    11664
V1     11664
dtype: int64

ytrain_OVS.count()

23375

Now we put the new dataset after oversampling to the grid, and conduct process of grid.

Parametry2 = {'C': np.power(10.0, np.arange(-3, 3))}
OVS_reg = LogisticRegression(warm_start = True, solver='lbfgs')
OVS_grid = GridSearchCV(OVS_reg, param_grid = Parametry2, scoring = 'roc_auc', n_jobs = 5, cv = 6)

OVS_grid.fit(Xtrain_OVS, ytrain_OVS)

GridSearchCV(cv=6, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

Now we use diagnostic block.

print()
print("Recall Training data:     ", np.round(recall_score(ytrain_OVS, OVS_grid.predict(Xtrain_OVS)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain_OVS, OVS_grid.predict(Xtrain_OVS)), decimals=4))

print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, OVS_grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, OVS_grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, OVS_grid.predict(Xtest)))


print("----------------------------------------------------------------------")
print(classification_report(ytest, OVS_grid.predict(Xtest)))

y_pred_proba = OVS_grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()

Recall Training data:      0.963
Precision Training data:   0.984
----------------------------------------------------------------------
Recall Test data:          0.9231
Precision Test data:       0.2105
----------------------------------------------------------------------
Confusion Matrix Test data
[[5653   90]
 [   2   24]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99      5743
         1.0       0.21      0.92      0.34        26

   micro avg       0.98      0.98      0.98      5769
   macro avg       0.61      0.95      0.67      5769
weighted avg       1.00      0.98      0.99      5769

216

GridSearchCV(cv=6, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight={0: 1, 1: 216}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

The Best parameter: {'C': 0.001}
The Best estimator: LogisticRegression(C=0.001, class_weight={0: 1, 1: 216}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=True)

Recall Training data:      0.9444
Precision Training data:   0.2914
----------------------------------------------------------------------
Recall Test data:          0.8846
Precision Test data:       0.284
----------------------------------------------------------------------
Confusion Matrix Test data
[[5653   90]
 [   2   24]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      5743
         1.0       0.28      0.88      0.43        26

   micro avg       0.99      0.99      0.99      5769
   macro avg       0.64      0.94      0.71      5769
weighted avg       1.00      0.99      0.99      5769

The 'Recall’ gauge increased to the level of 0.92, ratios 'precision’ fall to the level 0.21.

Class_weight

Using class_weight is the method to improve 'recall’ ratio in the unbalanced sets (unbalanced in result variable). Increase of 'recall’ is in the expense of 'precision’ ratio.

As we mentioned, in train set there 216 of variables marked 0 for one variable marked 1. To limit this disproportion we need to increase weigh of dependent variables marked 1 by 216 times.

Pw = sum(ytrain == 0) / sum(ytrain == 1)  # size to repeat y == 1
Pw = np.round(Pw, decimals=0)
Pw = Pw.astype(int)
Pw

216

Weight parameter 'positive weight’, PW=216
We need to increase of variable 1, we need in the equation use range {0: 1, 2:216}.

Parameters = {'C': np.power(10.0, np.arange(-3, 3))}
LogReg = LogisticRegression(class_weight = {0 : 1, 1 : Pw}, warm_start = True, solver='lbfgs')

Tuning the model by the grid.

LRV_Reg_grid = GridSearchCV(LogReg, param_grid = Parameters, scoring = 'roc_auc', n_jobs = 5, cv = 6)
LRV_Reg_grid.fit(Xtrain, ytrain)

GridSearchCV(cv=6, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight={0: 1, 1: 216}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

We check which hyperparameters were chosen.

print("The Best parameter:",LRV_Reg_grid.best_params_)
print("The Best estimator:",LRV_Reg_grid.best_estimator_)

The Best parameter: {'C': 0.001}
The Best estimator: LogisticRegression(C=0.001, class_weight={0: 1, 1: 216}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=True)

Recall Training data:      0.9444
Precision Training data:   0.2914
----------------------------------------------------------------------
Recall Test data:          0.8846
Precision Test data:       0.284
----------------------------------------------------------------------
Confusion Matrix Test data
[[5653   90]
 [   2   24]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      5743
         1.0       0.28      0.88      0.43        26

   micro avg       0.99      0.99      0.99      5769
   macro avg       0.64      0.94      0.71      5769
weighted avg       1.00      0.99      0.99      5769

As usually turn on diagnostic module.

print()
print("Recall Training data:     ", np.round(recall_score(ytrain, LRV_Reg_grid.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, LRV_Reg_grid.predict(Xtrain)), decimals=4))

print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LRV_Reg_grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LRV_Reg_grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, OVS_grid.predict(Xtest)))


print("----------------------------------------------------------------------")
print(classification_report(ytest, LRV_Reg_grid.predict(Xtest)))

y_pred_proba = LRV_Reg_grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()

Recall Training data:      0.9444
Precision Training data:   0.2914
----------------------------------------------------------------------
Recall Test data:          0.8846
Precision Test data:       0.284
----------------------------------------------------------------------
Confusion Matrix Test data
[[5653   90]
 [   2   24]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      5743
         1.0       0.28      0.88      0.43        26

   micro avg       0.99      0.99      0.99      5769
   macro avg       0.64      0.94      0.71      5769
weighted avg       1.00      0.99      0.99      5769

Thanks to apply class_weight, weight of dependent variables marked 1 was increased.
The 'Recall’ gauge increased to the level of 0.88, ratios 'precision’ fall to the level 0.28.

The same effect, can be achieved using automatically balancing with parameter class_weight=”balanced”.

	Unnamed: 0	Date	Time	CO(GT)	PT08.S1(CO)	C6H6(GT)	PT08.S2(NMHC)	NOx(GT)	PT08.S3(NOx)	NO2(GT)	PT08.S4(NO2)	PT08.S5(O3)	T	RH	AH
0	0	10/03/2004	18.00.00	2.6	1360.0	11.9	1046.0	166.0	1056.0	113.0	1692.0	1268.0	13.6	48.9	0.7578
1	1	10/03/2004	19.00.00	2.0	1292.0	9.4	955.0	103.0	1174.0	92.0	1559.0	972.0	13.3	47.7	0.7255
2	2	10/03/2004	20.00.00	2.2	1402.0	9.0	939.0	131.0	1140.0	114.0	1555.0	1074.0	11.9	54.0	0.7502

	Categores_PT08_S5_O3	CO_GT	PT08_S1_CO	NOx_GT	PT08_S3_NOx	NO2_GT
790	(220.999, 964.0]	1.4	1035.0	96.0	1222.0	82.0
2484	(220.999, 964.0]	0.9	890.0	93.0	922.0	67.0
4306	(220.999, 964.0]	0.5	828.0	72.0	1281.0	48.0

	Time	V1	V2	V3	V4	V5	V6	V7	V8	V9	…	V21	V22	V23	V24	V25	V26	V27	V28	Amount
0	0	-1.359807	-0.072781	2.536347	1.378155	-0.338321	0.462388	0.239599	0.098698	0.363787	…	-0.018307	0.277838	-0.110474	0.066928	0.128539	-0.189115	0.133558	-0.021053	149.62
1	0	1.191857	0.266151	0.166480	0.448154	0.060018	-0.082361	-0.078803	0.085102	-0.255425	…	-0.225775	-0.638672	0.101288	-0.339846	0.167170	0.125895	-0.008983	0.014724	2.69
2	1	-1.358354	-1.340163	1.773209	0.379780	-0.503198	1.800499	0.791461	0.247676	-1.514654	…	0.247998	0.771679	0.909412	-0.689281	-0.327642	-0.139097	-0.055353	-0.059752	378.66

	V3	V14	V17	V7	V10	V16	V12	V1	Class
min	-30.558697	-19.214325	-18.587366	-26.548144	-14.166795	-12.227189	-17.769143	-29.876366	0.000000
max	4.101716	7.692209	9.253526	34.303177	12.701538	4.816252	3.774837	1.960497	1.000000
mean	0.780300	0.652533	0.324366	-0.148176	-0.256112	-0.024921	-1.254557	-0.251191	0.004577
std	1.758847	1.351563	1.260242	1.340792	1.247107	0.964966	1.583679	1.889182	0.067498

Logistic regression - THE DATA SCIENCE LIBRARY

Przykład klasyfikacji wykonanej za pomocą regresji logistycznej. Eliminacja współliniowości zmiennych niezależnych za pomocą VIF

Zadanie

Wskazujemy zmienne niezależne:¶

Ocena modelu regresji logistycznej bez redukcji zjawiska multicollinearity¶

Eliniminacja zjawiska multicollinearity¶

Variance Inflation Factor (VIF)¶

Jeszcze raz tworzymy macierz korelacji vif i sprawdzamy zmienne po wyeliminowaniu:¶

Budujemy model klasyfikacji po VIF¶

Ocena modelu regresji logistycznej bez redukcji zjawiska multicollinearity¶

Logistic regression model (bank cards operations)

Introduction

Main principles of logistic regression model

Analyze of the level of balance of the dependent variables

Analyze of independent variables in the logistic regression model

Statistical analysis of independent variables

Creating model of logistic regression

Parameteres¶

warm_start

scoring = 'roc_auc’¶

jobs = 5

cv = 2

Evaluation of the classification of the logistic regression

Changing of the threshold on the logistic regression

Oversampling

Oversampling by the cloning

Undersampling by the eliminate

Procedure of oversampling

Class_weight

	name	vif
0	Intercept	648.23
1	CO_GT	11.44
2	PT08_S1_CO	7.94
3	C6H6_GT	46.44
4	PT08_S2_NMHC	66.47
5	NOx_GT	5.22
6	PT08_S3_NOx	5.43
7	NO2_GT	3.71
8	PT08_S4_NO2	14.52
9	T	14.36
10	RH	8.18
11	AH	10.62

	name	vif
0	Intercept	200.16
1	CO_GT	5.90
2	PT08_S1_CO	5.81
3	NOx_GT	3.10
4	PT08_S3_NOx	2.79
5	NO2_GT	2.47