Logistic regression - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/logistic-regression/ Wojciech Moszczyński Thu, 30 Jan 2020 11:27:32 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png Logistic regression - THE DATA SCIENCE LIBRARY https://sigmaquality.pl/tag/logistic-regression/ 32 32 Przykład klasyfikacji wykonanej za pomocą regresji logistycznej. Eliminacja współliniowości zmiennych niezależnych za pomocą VIF https://sigmaquality.pl/uncategorized/przyklad-klasyfikacji-wykonanej-za-pomoca-regresji-logistycznej-eliminacja-wspolliniowosci-zmiennych-niezaleznych-za-pomoca-vif-pl140120202018/ Tue, 14 Jan 2020 19:21:24 +0000 http://sigmaquality.pl/przyklad-klasyfikacji-wykonanej-za-pomoca-regresji-logistycznej-eliminacja-wspolliniowosci-zmiennych-niezaleznych-za-pomoca-vif-pl140120202018/ PL140120202018 Rozważania teoretyczne na temat znaczenia (szkodliwości) wspołiniowości zmiennych niezależnych (multicollinearity) zostały wyjaśnione w poprzedniej części. Problem występowania współliniowości zmiennych niezależnych w regresji logistycznej. Przykład [...]

Artykuł Przykład klasyfikacji wykonanej za pomocą regresji logistycznej. Eliminacja współliniowości zmiennych niezależnych za pomocą VIF pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
PL140120202018

Rozważania teoretyczne na temat znaczenia (szkodliwości) wspołiniowości zmiennych niezależnych (multicollinearity) zostały wyjaśnione w poprzedniej części.

Problem występowania współliniowości zmiennych niezależnych w regresji logistycznej. Przykład użycia testu VIF.

Tym razem będziemy analizowali zagadnienie zanieczyszczenia miasta.
Źródło danych można znaleźć tutaj.

In [1]:
import pandas as pd
df = pd.read_csv('c:/TF/AirQ_filled.csv')
df.head(3)
Out[1]:
Unnamed: 0 Date Time CO(GT) PT08.S1(CO) C6H6(GT) PT08.S2(NMHC) NOx(GT) PT08.S3(NOx) NO2(GT) PT08.S4(NO2) PT08.S5(O3) T RH AH
0 0 10/03/2004 18.00.00 2.6 1360.0 11.9 1046.0 166.0 1056.0 113.0 1692.0 1268.0 13.6 48.9 0.7578
1 1 10/03/2004 19.00.00 2.0 1292.0 9.4 955.0 103.0 1174.0 92.0 1559.0 972.0 13.3 47.7 0.7255
2 2 10/03/2004 20.00.00 2.2 1402.0 9.0 939.0 131.0 1140.0 114.0 1555.0 1074.0 11.9 54.0 0.7502

Na początku sprawdzimy kompletność danych oraz ich format.

In [2]:
del df['Unnamed: 0']
df.isnull().sum()
Out[2]:
Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
dtype: int64
In [3]:
df.dtypes
Out[3]:
Date              object
Time              object
CO(GT)           float64
PT08.S1(CO)      float64
C6H6(GT)         float64
PT08.S2(NMHC)    float64
NOx(GT)          float64
PT08.S3(NOx)     float64
NO2(GT)          float64
PT08.S4(NO2)     float64
PT08.S5(O3)      float64
T                float64
RH               float64
AH               float64
dtype: object

Dane są kompletne i mają właściwy format do prowadzenia dalszej analizy.

Zadanie

Naszym zadaniem jest podział na trzy klasy zanieczyszczenia oznaczonego jako PT08.S5(03). Następnie musimy zbudować model klasyfikacji logistycznej oparty na pozostałych zmiennych w postaci poziomów zanieczyszczeń.

Korelacja zmiennych niezależnych ze zmienną zależną.

In [4]:
CORREL = df.corr()

import seaborn as sns
sns.heatmap(CORREL, annot=True, cbar=False, cmap="coolwarm")
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x243da06fb38>

Powyższa macierz wskazuje, że istnieje współliniowość, która negatywnie wpłynie na poziom klasyfikacji. Nie zważając na to kontynujemy realizację naszego zadania. Poniżej przeprowadziłem analizę korelacji zmiennej zależnej: 'PT08.S5(O3)’ ze zmiennymi niezależnymi. W wiekszości występuje wysoki poziom korelacji co jest zjawiskiem pozytywnym.

In [5]:
CORREL2 = df.corr().sort_values('PT08.S5(O3)')
CORREL2['PT08.S5(O3)'].plot(kind='bar')
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x243da5f1400>
In [6]:
df['Categores_PT08.S5(O3)'] = pd.qcut(df['PT08.S5(O3)'],2)
df['Categores_PT08.S5(O3)'].value_counts().to_frame()
Out[6]:
Categores_PT08.S5(O3)
(220.999, 964.0] 4683
(964.0, 2523.0] 4674

Badana kategoria PT08.S5(O3) została podzielona na równe przedziały. Mniejsza o to czy ten podział jest uzasadniony chemicznie. Najważniejsze że teraz możemy utworzyć model klasyfikacji oparty na regresji logistycznej.
Równy podział zbiorów oddala nas od negatywnego zjawiska niezbilansowania zbiorów i koniczności tworzenia oversampling. Dzisiaj nie to jest najważniejsze lecz przećwiczenie radzenia sobie ze zjawiskiej multicollinearity.

Zbudjemy model klasyfikacji tak jakby nie dotyczył nas problem multicollinearity.

Wskazujemy zmienne niezależne:
In [7]:
KOT = df[['CO(GT)', 'PT08.S1(CO)', 'C6H6(GT)', 'PT08.S2(NMHC)',
       'NOx(GT)', 'PT08.S3(NOx)', 'NO2(GT)', 'PT08.S4(NO2)', 
       'T', 'RH', 'AH','Categores_PT08.S5(O3)']]
KOT.columns = ['CO_GT', 'PT08_S1_CO', 'C6H6_GT', 'PT08_S2_NMHC',
       'NOx_GT', 'PT08_S3_NOx', 'NO2_GT', 'PT08_S4_NO2', 
       'T', 'RH', 'AH','Categores_PT08_S5_O3']

Przekształcamy zmienną zależną: Categores_PT08.S5(O3) na zmienną ciągłą.

In [8]:
KOT['Categores_PT08_S5_O3'] = KOT['Categores_PT08_S5_O3'].astype(str)
KOT['Categores_PT08_S5_O3'].dtypes
C:ProgramDataAnaconda3libsite-packagesipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
Out[8]:
dtype('O')
In [9]:
from sklearn.model_selection import train_test_split 


y = KOT['Categores_PT08_S5_O3']
X = KOT.drop('Categores_PT08_S5_O3', axis=1)


Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.33, stratify = y, random_state = 148)
In [10]:
KOT.dtypes
Out[10]:
CO_GT                   float64
PT08_S1_CO              float64
C6H6_GT                 float64
PT08_S2_NMHC            float64
NOx_GT                  float64
PT08_S3_NOx             float64
NO2_GT                  float64
PT08_S4_NO2             float64
T                       float64
RH                      float64
AH                      float64
Categores_PT08_S5_O3     object
dtype: object
In [11]:
print ('Zbiór X treningowy: ',Xtrain.shape)
print ('Zbiór X testowy:    ', Xtest.shape)
print ('Zbiór y treningowy: ', ytrain.shape)
print ('Zbiór y testowy:    ', ytest.shape)
Zbiór X treningowy:  (6269, 11)
Zbiór X testowy:     (3088, 11)
Zbiór y treningowy:  (6269,)
Zbiór y testowy:     (3088,)
In [12]:
import numpy as np
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)

LR_Grid.fit(Xtrain, ytrain)
C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[12]:
GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

Ocena modelu regresji logistycznej bez redukcji zjawiska multicollinearity

In [13]:
ypred = LR_Grid.predict(Xtest)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(ytest, ypred)
co_matrix
Out[13]:
array([[1414,  131],
       [ 176, 1367]], dtype=int64)
In [14]:
print(classification_report(ytest, ypred))
                  precision    recall  f1-score   support

(220.999, 964.0]       0.89      0.92      0.90      1545
 (964.0, 2523.0]       0.91      0.89      0.90      1543

        accuracy                           0.90      3088
       macro avg       0.90      0.90      0.90      3088
    weighted avg       0.90      0.90      0.90      3088

Model wykazuje się doskonałymi zdolnościmi prognostycznymi w klasyfikacji zanieczyszczenia. Nie oznacza to jednak, że wszystko jest doskonale. Wśród zmiennych opisujących występuje wysoka multicollinearity. Jest to niekorzystne zjawiski, które teraz wyelinimujemy.

Eliniminacja zjawiska multicollinearity

Variance Inflation Factor (VIF)

Variance Inflation Factor (VIF) to miara multicollinearity między zmiennymi predykcyjnymi w regresji wielokrotnej. Określa ilościowo nasilenie multicollinearity w zwykłej analizie regresji metodą najmniejszych kwadratów. Zapewnia wskaźnik, który mierzy, o ile wariancja (kwadrat odchylenia standardowego oszacowania) szacowanego współczynnika regresji jest zwiększona z powodu kolinearności.

Kroki wdrażania VIF

  1. Uruchom regresję wieloraką.
  2. Oblicz współczynniki VIF.
  3. Sprawdź współczynniki dla każdej zmiennej predykcyjnej, jeśli VIF wynosi między 5-10, prawdopodobnie występuje multicollinearity i powinieneś rozważyć usunięcie tej zmiennej.

Czyli przed rozpoczęciem pracy nad modelem regresji logistycznej należy utworzyć model regresji wielorakiej i przez VIF wyselekcjonować, wybrać dane do modelu regresji logistycznej.

Jak wspominają tutaj: https://www.ibm.com/support/pages/multicollinearity-diagnostics-logistic-regression-nomreg-or-plum

Procedury regresji dla zmiennych zależnych kategorycznie nie mają diagnostyki kolinearności. W tym celu można jednak użyć procedury regresji liniowej. Statystyka kolinearności w regresji dotyczy relacji między predyktorami, ignorując zmienną zależną. Możesz więc uruchomić REGRESSION z tą samą listą predyktorów i zmiennej zależnej, jakiej chcesz użyć w REGRESJI LOGISTYCZNEJ (na przykład) i zażądać diagnostyki kolinearności. Uruchom regresję logistyczną, aby uzyskać właściwe współczynniki, przewidywane prawdopodobieństwa itp. Po podjęciu niezbędnych decyzji (porzucenie predyktorów itp.) Wynikających z analizy kolinearności.

In [16]:
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.formula.api as smf

lm = smf.ols(formula = 'Categores_PT08_S5_O3 ~ CO_GT+PT08_S1_CO+C6H6_GT+PT08_S2_NMHC+NOx_GT+PT08_S3_NOx+NO2_GT+PT08_S4_NO2+T+RH+AH', data = KOT).fit()
y, X = dmatrices('Categores_PT08_S5_O3 ~ CO_GT+PT08_S1_CO+C6H6_GT+PT08_S2_NMHC+NOx_GT+PT08_S3_NOx+NO2_GT+PT08_S4_NO2+T+RH+AH', data = KOT, return_type = "dataframe")
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
[648.230997747899, 11.43831319456862, 7.939972023928728, 46.44356640618427, 66.47134191824114, 5.220873142678388, 5.43267527445646, 3.714154563031008, 14.52279194616348, 14.359778937836424, 8.176237173633366, 10.619321015188527]
In [17]:
vif  =  np.round(vif, decimals=2) 
vif = list(map(float, vif))
name = list(X)

s1=pd.Series(name,name='name')
s2=pd.Series( vif,name='vif')

RFE_list = pd.concat([s1,s2], axis=1)

RFE_list
Out[17]:
name vif
0 Intercept 648.23
1 CO_GT 11.44
2 PT08_S1_CO 7.94
3 C6H6_GT 46.44
4 PT08_S2_NMHC 66.47
5 NOx_GT 5.22
6 PT08_S3_NOx 5.43
7 NO2_GT 3.71
8 PT08_S4_NO2 14.52
9 T 14.36
10 RH 8.18
11 AH 10.62

Interpretacja: wynik w postaci wektora reprezentuje zmienną w określonej kolejności jak w modelu. W zaleceniu VIF wskazano, że jeśli współczynnik przypisany do zmiennej jest większy niż 5, zmienna ta jest wysoce skorelowana z innymi zmiennymi i powinna zostać wyeliminowana z modelu.

Test wykazał, że zmienne: density oraz factorA powinny zostać usunięta z modelu.
jeszcze raz oglądamy macierz korelacji

Jeszcze raz tworzymy macierz korelacji vif i sprawdzamy zmienne po wyeliminowaniu:

  • C6H6_GT
  • PT08_S2_NMHC
  • PT08_S4_NO2Eliminuje też zmienne gdzie korelacja ze zmienną zależną było bardzo małe a zmienne miały VIF powyżej 5:
  • T
  • RH
  • AH
In [18]:
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.formula.api as smf

lm = smf.ols(formula = 'Categores_PT08_S5_O3 ~ CO_GT+PT08_S1_CO+NOx_GT+PT08_S3_NOx+NO2_GT', data = KOT).fit()
y, X = dmatrices('Categores_PT08_S5_O3 ~ CO_GT+PT08_S1_CO+NOx_GT+PT08_S3_NOx+NO2_GT', data = KOT, return_type = "dataframe")
vif = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)
[200.1579775895654, 5.895905610386816, 5.807909737315139, 3.1029777937073924, 2.7926706710126528, 2.4675790616762594]
In [19]:
vif  =  np.round(vif, decimals=2) 
vif = list(map(float, vif))
name = list(X)

s1=pd.Series(name,name='name')
s2=pd.Series( vif,name='vif')

RFE_list = pd.concat([s1,s2], axis=1)

RFE_list
Out[19]:
name vif
0 Intercept 200.16
1 CO_GT 5.90
2 PT08_S1_CO 5.81
3 NOx_GT 3.10
4 PT08_S3_NOx 2.79
5 NO2_GT 2.47

Teraz zmienne, pomimo wysokich stanów przekraczających 5, to są: CO_GT i PT08_S1_CO są akceptowane w zakresie poziomu multicollinearity. Tworzę nowy model klasyfikacji na okrojonym zbiorze zmienych opisujących.

In [20]:
KOT2 = KOT[['Categores_PT08_S5_O3', 'CO_GT', 'PT08_S1_CO','NOx_GT', 'PT08_S3_NOx', 'NO2_GT']]

KOT2.sample(3)
Out[20]:
Categores_PT08_S5_O3 CO_GT PT08_S1_CO NOx_GT PT08_S3_NOx NO2_GT
790 (220.999, 964.0] 1.4 1035.0 96.0 1222.0 82.0
2484 (220.999, 964.0] 0.9 890.0 93.0 922.0 67.0
4306 (220.999, 964.0] 0.5 828.0 72.0 1281.0 48.0

Budujemy model klasyfikacji po VIF

In [22]:
#from sklearn.model_selection import train_test_split 


y = KOT2['Categores_PT08_S5_O3']
X = KOT2.drop('Categores_PT08_S5_O3', axis=1)


Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.33, stratify = y, random_state = 148)
In [23]:
#import numpy as np
#from sklearn import model_selection
#from sklearn.pipeline import make_pipeline
#from sklearn.linear_model import LogisticRegression
#from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_VIF = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)

LR_VIF.fit(Xtrain, ytrain)
C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[23]:
GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

Ocena modelu regresji logistycznej bez redukcji zjawiska multicollinearity

In [24]:
ypred = LR_VIF.predict(Xtest)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(ytest, ypred)
co_matrix
Out[24]:
array([[1387,  158],
       [ 182, 1361]], dtype=int64)
In [26]:
print(classification_report(ytest, ypred))
                  precision    recall  f1-score   support

(220.999, 964.0]       0.88      0.90      0.89      1545
 (964.0, 2523.0]       0.90      0.88      0.89      1543

        accuracy                           0.89      3088
       macro avg       0.89      0.89      0.89      3088
    weighted avg       0.89      0.89      0.89      3088

Nowy model zbudowany na pięciu niezwiązanych przez multicollinearity jest nieznacznie gorszy (na poziomie 1-2

Artykuł Przykład klasyfikacji wykonanej za pomocą regresji logistycznej. Eliminacja współliniowości zmiennych niezależnych za pomocą VIF pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Logistic regression model (bank cards operations) https://sigmaquality.pl/uncategorized/logistic-regression-model-bank-cards-operations/ Fri, 27 Sep 2019 19:24:00 +0000 http://sigmaquality.pl/logistic-regression-model-bank-cards-operations/ Introduction Logistic regression is algorithm of classification machine learning. Model predicts binary state of dependent variable. Dependent result variable takes value from 0 to 1. [...]

Artykuł Logistic regression model (bank cards operations) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Introduction

Logistic regression is algorithm of classification machine learning. Model predicts binary state of dependent variable. Dependent result variable takes value from 0 to 1.

Main principles of logistic regression model

  1. Dependent variable has a binary form.
  2. Model contain only variables who have significant influence on the result.
  3. There are no collinearity among independent variables (no correlation among predictors).
  4. Logistic model needs big number of observations.

Let’s open data with needed libraries. We will be working on the registry of bank cards operations. In colum 'Class’ value 0 mean: lack of fraud in transaction, value 1 point embezzlement. Let’s take an assumption, main aim is to correctly classification of transactions with embezzlement. We can unfortunately certificate good transaction as the transaction with fraud.

In [1]:
## Procedura Logistic Regression DLA ZMIENNYCH OPISUJĄCYCH CIĄGŁYCH 

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler, Normalizer, scale
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
from sklearn.model_selection import cross_val_score, train_test_split, KFold, StratifiedShuffleSplit, GridSearchCV
from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix

df = pd.read_csv('c:/1/creditcard.csv')
df.head(3)
Out[1]:
Time V1 V2 V3 V4 V5 V6 V7 V8 V9 V21 V22 V23 V24 V25 V26 V27 V28 Amount Class
0 0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599 0.098698 0.363787 -0.018307 0.277838 -0.110474 0.066928 0.128539 -0.189115 0.133558 -0.021053 149.62 0.0
1 0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803 0.085102 -0.255425 -0.225775 -0.638672 0.101288 -0.339846 0.167170 0.125895 -0.008983 0.014724 2.69 0.0
2 1 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461 0.247676 -1.514654 0.247998 0.771679 0.909412 -0.689281 -0.327642 -0.139097 -0.055353 -0.059752 378.66 0.0

3 rows × 31 columns

In [2]:
df.columns
Out[2]:
Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')

Analyze of the level of balance of the dependent variables

Assembly of variable is balanced when values 1 and 0 have likely number of occurrences. Result variables assembly is unbalanced because the subject of investigation rare phenomenon (showed as 1). Variable 1 is appeared one of the 1000 transaction marked as 0. When the variable assembly is unbalanced, then we can appear phenomenon, where model would be ignored minority variable (marked as 1). Such model has very high level of 'recall’ ratios, despite ignore even all 1 variables.

We check out level of unbalanced in registry.

In [3]:
df.Class.value_counts(dropna = False)
Out[3]:
0.0    17400
1.0       80
NaN        1
Name: Class, dtype: int64
In [4]:
sns.countplot(x='Class',data=df, palette='GnBu_d')
plt.show()

The plot is not very readable. We make the percentage structure.

Analyze of independent variables in the logistic regression model

Let’s remind two important assumptions for logistic regression.

  1. Inmodel, we ought to took only independent variables who have significant influence on the result variable.
  2. Independent variable should be mutual non correlated.

In the first stage we check how independent variables influence on the result variable.

In [5]:
df.Class.value_counts(dropna = False, normalize=True)
Out[5]:
0.0    0.995366
1.0    0.004576
NaN    0.000057
Name: Class, dtype: float64

I check what the amount on transaction with fraud and on regular transaction.

In [6]:
df.groupby('Class').Amount.mean() 
Out[6]:
Class
0.0    66.927028
1.0    98.082375
Name: Amount, dtype: float64
In [7]:
df.Amount.agg(['min','max','mean','std']).astype(int)
Out[7]:
min        0
max     7712
mean      67
std      188
Name: Amount, dtype: int32

As we see scale of fraud, represented as 1 in column 'Class’, poses only 0,46

In the first stage we check how independent variables influence on the result variable.

In [8]:
CORREL = df.corr().sort_values('Class')
CORREL['Class']
Out[8]:
V3       -0.524607
V14      -0.482336
V17      -0.474386
V7       -0.441136
V10      -0.393678
V16      -0.369733
V12      -0.349516
V1       -0.315801
V5       -0.297279
V18      -0.235021
V9       -0.204471
V6       -0.136283
V23      -0.044445
V24      -0.034291
V22      -0.026568
V13      -0.016051
V15      -0.006326
Amount    0.011165
V26       0.018321
V28       0.021741
Time      0.022051
V25       0.032994
V19       0.038577
V21       0.048056
V20       0.078527
V27       0.157495
V8        0.233020
V4        0.281417
V2        0.287273
V11       0.309185
Class     1.000000
Name: Class, dtype: float64

Correlation vector show, that some of the independent variables have small or none influence on the result.

Statistical analysis of independent variables

We check how is the difference between 1 and 0 in the 'Amount’ and 'Time’ of transactions.

In [9]:
pd.pivot_table(df, index='Class', values = 'Amount', aggfunc= [np.mean, np.median, min, max, np.std] )
Out[9]:
mean median min max std
Amount Amount Amount Amount Amount
Class
0.0 66.927028 15.95 0.0 7712.43 187.89952
1.0 98.082375 1.00 0.0 1809.68 269.15147
In [10]:
pd.pivot_table(df, index='Class', values = 'Time', aggfunc= [np.mean, np.median, min, max, np.std])
Out[10]:
mean median min max std
Time Time Time Time Time
Class
0.0 13515.67069 11854.0 0 28753 9702.669336
1.0 16684.05000 17203.5 406 28726 8184.483331

We point the columns who have any correlation upper than 0.4 with dependent variable.

In [11]:
kot = CORREL[(CORREL['Class']>0.4)|(CORREL['Class']<-0.3)][['Class']]
kot.index
Out[11]:
Index(['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1', 'Class'], dtype='object')

We collected only variables strong correlated with result variable.

In [12]:
CORREL = df[['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1', 'Class']].corr().sort_values('Class')
CORREL['Class']
Out[12]:
V3      -0.524607
V14     -0.482336
V17     -0.474386
V7      -0.441136
V10     -0.393678
V16     -0.369733
V12     -0.349516
V1      -0.315801
Class    1.000000
Name: Class, dtype: float64

We rather don’t use 'Amount’ to the model. For example, we do standardization of this category.

In [13]:
#scaler = StandardScaler()
#df['Amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1, 1))

Let’s check correlation among independent variables.

In [14]:
sns.heatmap (df.corr (), cmap="coolwarm")
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d6d2ce3c8>

Before we start compose model we will need to do order in data.

  1. We remove columns what we will do not use.
  2. We remove records with the lack (empty) values.
  3. We do not standardized data.
  4. We check statistic parametrs of the data.
In [15]:
#df.drop(columns = 'Time', inplace = True)    
#del df['Amount']                             

df.isnull().sum()                             
df = df.dropna(how='any')                    

df.agg(['min','max','mean','std'])[['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1', 'Class']]  
Out[15]:
V3 V14 V17 V7 V10 V16 V12 V1 Class
min -30.558697 -19.214325 -18.587366 -26.548144 -14.166795 -12.227189 -17.769143 -29.876366 0.000000
max 4.101716 7.692209 9.253526 34.303177 12.701538 4.816252 3.774837 1.960497 1.000000
mean 0.780300 0.652533 0.324366 -0.148176 -0.256112 -0.024921 -1.254557 -0.251191 0.004577
std 1.758847 1.351563 1.260242 1.340792 1.247107 0.964966 1.583679 1.889182 0.067498
In [48]:
sns.heatmap (df[['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1']].corr(), cmap="YlGnBu", annot=True, cbar=False)
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x23d6ee5ca58>

We can find any significant mutual correlation between independent variables. Before we start creating the model we ought to remove all records with empty cells.

Creating model of logistic regression

We declare where is independent, descriptive variables and dependent, result variable. We point the divide of training and test assemblies.

In [16]:
feature_cols = ['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1']
X = df[feature_cols] 
y = df.Class

Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .33, stratify = y, random_state = 148)

We are configure settings for grid.

In [17]:
Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)

Explanation for the code:

Parameteres

Parameteres = {’C’: np.power(10.0, np.arange(-3, 3))}
array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])
We have tipical setting for the grid.

warm_start

Using 'warm_start=True’ causes utilization of last model setting to next run of the model. Thank to this model speeds up time of finding convergence. Parameter warm_start is useful in making multiple convergences with the same model using various settings.

scoring = 'roc_auc’

The ROC plot estimates the best of setting of classification. Finding the area under the ROC curve is the most popular method of evaluation of classification efficiency by the grid.

jobs = 5

Number of tasks running simultaneously

cv = 2

Number of cross verifications.
Model takes the form of the equation:

In [50]:
LR_Grid.fit(Xtrain, ytrain) 
C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:433: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[50]:
GridSearchCV(cv=2, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

We check what of supermatamiters have been chosen as the best by the grid.

In [19]:
print("The Best parameter:",LR_Grid.best_params_)
print("The Best estimator:",LR_Grid.best_estimator_)
The Best parameter: {'C': 1.0}
The Best estimator: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=True)

Evaluation of the classification of the logistic regression

We use diagnostic block, we put it into the code.

In [20]:
print("n------Training data---------------------------------------------------")
print("The RECALL Training data:      ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print("The PRECISION Training data:   ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print()
print("------Test data-------------------------------------------------------")

print("The RECALL Test data is:        ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=3))
print("The PRECISION Test data is:     ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=3))
print()
print("The Confusion Matrix Test data :--------------------------------------")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))

# PLOT
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
#plt.axvline(0.5, color = '#00C851', linestyle = '--')
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
------Training data---------------------------------------------------
The RECALL Training data:       0.648
The PRECISION Training data:    0.686

------Test data-------------------------------------------------------
The RECALL Test data is:         0.615
The PRECISION Test data is:      0.667

The Confusion Matrix Test data :--------------------------------------
[[5735    8]
 [  10   16]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.67      0.62      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.83      0.81      0.82      5769
weighted avg       1.00      1.00      1.00      5769

The 'Recall’ gauge increased to the level of 0.62, ratios 'precision’ fall to the level 0.67.

Changing of the threshold on the logistic regression


Changing of the threshold on the logistic regression increase level of 'recall’ in the expense of level of accuracy shown by the 'precision’ ratio. Let’s remind, bank track all fraud and defraud nations made on the credit cards. Cost of false accusation, when the model call clear transaction as the fraud is relatively small. Let’s remind, bank track all fraud and defraud nations made on the credit cards. Cost of false accusation, when the model call clear transaction as the fraud is relatively small. Bank especially interested in find fraud at all cost.

In the logistic regression model default threshold is 0,5

In [21]:
LR_Grid_ytest = LR_Grid.predict_proba(Xtest)[:, 1]
new_threshold = 0.1 
ytest_pred = (LR_Grid_ytest >= new_threshold).astype(int)
In [22]:
ytest_pred
Out[22]:
array([0, 0, 0, ..., 0, 0, 0])

We launch diagnostic module.

In [23]:
print("n------Training data---------------------------------------------------")
print("RECALL Training data (new_threshold = 0.1):      ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print("PRECISION Training data (new_threshold = 0.1):   ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))

print("------Test data-------------------------------------------------------")

print("RECALL Test data (new_threshold = 0.1):        ", np.round(recall_score(ytest, ytest_pred), decimals=3))
print("PRECISION Test data (new_threshold = 0.1):     ", np.round(precision_score(ytest, ytest_pred), decimals=3))
print()
print("The Confusion Matrix Test data (new_threshold = 0.1):-----------------")
print(confusion_matrix(ytest, ytest_pred))
print("----------------------------------------------------------------------")
print(classification_report(ytest, ytest_pred))

# WYKRES-------------------------------------------
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.axvline(0.1, color = '#00C251', linestyle = '--', label = 'threshold = 0.1')
plt.axvline(0.5, color = 'grey', linestyle = '--', label = 'threshold = 0.5')
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')

plt.show()
------Training data---------------------------------------------------
RECALL Training data (new_threshold = 0.1):       0.648
PRECISION Training data (new_threshold = 0.1):    0.686
------Test data-------------------------------------------------------
RECALL Test data (new_threshold = 0.1):         0.731
PRECISION Test data (new_threshold = 0.1):      0.576

The Confusion Matrix Test data (new_threshold = 0.1):-----------------
[[5729   14]
 [   7   19]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      5743
         1.0       0.58      0.73      0.64        26

   micro avg       1.00      1.00      1.00      5769
   macro avg       0.79      0.86      0.82      5769
weighted avg       1.00      1.00      1.00      5769

The 'Recall’ gauge increased to the level of 0.73, ratios 'precision’ fall to the level 0.58.

Oversampling


Oversampling is the method of partial removal of effects of unbalanced set of result variables. Method is use for the train set to training the model. When the result registry is not balanced, model have tendency to avoid rare results (1) for the benefit of frequent results (0). Frankly speaking all models have tendency to generalize of the reality. From the bank point of view it is more important to point malversation, despite the model can call proper transaction as fraud, than avoid couple of fraud in the registry. Sensitization of the model for the rare variables is possible by the oversampling and by the move of border of probability. Model sensitization on the frauds move down level of 'Precision’ on the benefit of 'recall’ ratio.

Oversampling by the cloning

Random oversampling consist in supplement minority variables by the copy of this minority variables. Oversampling (copying) can be create more than one times (2x, 3x, 5x, 10x et cetera)

Undersampling by the eliminate

Random undersample consist on elimination of samples from the class of majority (class 0) with the exchange or not. This is one of the earliest techniq of eliminate unbalancing in datasets. Undersampling can increase of variance of the clasificator and teoretycaly trow out usefull variables.
Source of data: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis

Procedure of oversampling

Let’s remind how unbalanced is our dataset.

In [24]:
df.Class.value_counts(dropna = False, normalize=True)
Out[24]:
0.0    0.995423
1.0    0.004577
Name: Class, dtype: float64

Minority variables poses only 0.46

In [25]:
print("ytrain = 0: ", sum(ytrain == 0))
print("ytrain = 1: ", sum(ytrain == 1))
ytrain = 0:  11657
ytrain = 1:  54
In [26]:
###### OVERSAMPLING  #######################################

OVS_gauge = sum(ytrain == 0) / sum(ytrain == 1)
OVS_gauge = np.round(OVS_gauge, decimals=0)
OVS_gauge = OVS_gauge.astype(int)
OVS_gauge
Out[26]:
216

In the training set for the one fraud falls 216 regular transaction. As we see train set remains with the same proportion to the all set thanks to parameter (stratify = y) in equation who define division in training set and test set. Now we can increase by 217 number of dependent variables marked as 1.

In [27]:
ytrain_pos_OVS = pd.concat([ytrain[ytrain==1]] * OVS_gauge, axis = 0) 
ytrain_pos_OVS.count()
Out[27]:
11664

This number is the sum of fraud variables in the training set (there are 54 such variables) multiplied by 216. Now we ought to do it for all independent variables X everywhere where result y was 1.

In [28]:
#Xtrain.loc[ytrain==1, :]
In [29]:
Xtrain.count()
Out[29]:
V3     11711
V14    11711
V17    11711
V7     11711
V10    11711
V16    11711
V12    11711
V1     11711
dtype: int64
In [30]:
Xtrain.loc[ytrain==1, :].count()
Out[30]:
V3     54
V14    54
V17    54
V7     54
V10    54
V16    54
V12    54
V1     54
dtype: int64

This records with y=1 should be multiplied 216 times.

In [31]:
Xtrain_pos_OVS = pd.concat([Xtrain.loc[ytrain==1, :]] * OVS_gauge, axis = 0)

Now we enter new, additional variables to the training set.

In [32]:
# concat the repeated data with the original data together
ytrain_OVS = pd.concat([ytrain, ytrain_pos_OVS], axis = 0).reset_index(drop = True)
Xtrain_OVS = pd.concat([Xtrain, Xtrain_pos_OVS], axis = 0).reset_index(drop = True)

At the beginning of the study we had 11711 records in dataset.

In [33]:
Xtrain_pos_OVS.count()
Out[33]:
V3     11664
V14    11664
V17    11664
V7     11664
V10    11664
V16    11664
V12    11664
V1     11664
dtype: int64
In [34]:
ytrain_OVS.count()
Out[34]:
23375

Now we put the new dataset after oversampling to the grid, and conduct process of grid.

In [35]:
Parametry2 = {'C': np.power(10.0, np.arange(-3, 3))}
OVS_reg = LogisticRegression(warm_start = True, solver='lbfgs')
OVS_grid = GridSearchCV(OVS_reg, param_grid = Parametry2, scoring = 'roc_auc', n_jobs = 5, cv = 6)

OVS_grid.fit(Xtrain_OVS, ytrain_OVS)
Out[35]:
GridSearchCV(cv=6, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

Now we use diagnostic block.

In [36]:
print()
print("Recall Training data:     ", np.round(recall_score(ytrain_OVS, OVS_grid.predict(Xtrain_OVS)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain_OVS, OVS_grid.predict(Xtrain_OVS)), decimals=4))

print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, OVS_grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, OVS_grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, OVS_grid.predict(Xtest)))


print("----------------------------------------------------------------------")
print(classification_report(ytest, OVS_grid.predict(Xtest)))

y_pred_proba = OVS_grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
Recall Training data:      0.963
Precision Training data:   0.984
----------------------------------------------------------------------
Recall Test data:          0.9231
Precision Test data:       0.2105
----------------------------------------------------------------------
Confusion Matrix Test data
[[5653   90]
 [   2   24]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      0.98      0.99      5743
         1.0       0.21      0.92      0.34        26

   micro avg       0.98      0.98      0.98      5769
   macro avg       0.61      0.95      0.67      5769
weighted avg       1.00      0.98      0.99      5769

The 'Recall’ gauge increased to the level of 0.92, ratios 'precision’ fall to the level 0.21.

Class_weight


Using class_weight is the method to improve 'recall’ ratio in the unbalanced sets (unbalanced in result variable). Increase of 'recall’ is in the expense of 'precision’ ratio.

As we mentioned, in train set there 216 of variables marked 0 for one variable marked 1. To limit this disproportion we need to increase weigh of dependent variables marked 1 by 216 times.

In [37]:
Pw = sum(ytrain == 0) / sum(ytrain == 1)  # size to repeat y == 1
Pw = np.round(Pw, decimals=0)
Pw = Pw.astype(int)
Pw
Out[37]:
216

Weight parameter 'positive weight’, PW=216
We need to increase of variable 1, we need in the equation use range {0: 1, 2:216}.

In [38]:
Parameters = {'C': np.power(10.0, np.arange(-3, 3))}
LogReg = LogisticRegression(class_weight = {0 : 1, 1 : Pw}, warm_start = True, solver='lbfgs')

Tuning the model by the grid.

In [39]:
LRV_Reg_grid = GridSearchCV(LogReg, param_grid = Parameters, scoring = 'roc_auc', n_jobs = 5, cv = 6)
LRV_Reg_grid.fit(Xtrain, ytrain)
Out[39]:
GridSearchCV(cv=6, error_score='raise-deprecating',
       estimator=LogisticRegression(C=1.0, class_weight={0: 1, 1: 216}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=True),
       fit_params=None, iid='warn', n_jobs=5,
       param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring='roc_auc', verbose=0)

We check which hyperparameters were chosen.

In [40]:
print("The Best parameter:",LRV_Reg_grid.best_params_)
print("The Best estimator:",LRV_Reg_grid.best_estimator_)
The Best parameter: {'C': 0.001}
The Best estimator: LogisticRegression(C=0.001, class_weight={0: 1, 1: 216}, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=True)

As usually turn on diagnostic module.

In [41]:
print()
print("Recall Training data:     ", np.round(recall_score(ytrain, LRV_Reg_grid.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, LRV_Reg_grid.predict(Xtrain)), decimals=4))

print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LRV_Reg_grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LRV_Reg_grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, OVS_grid.predict(Xtest)))


print("----------------------------------------------------------------------")
print(classification_report(ytest, LRV_Reg_grid.predict(Xtest)))

y_pred_proba = LRV_Reg_grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
Recall Training data:      0.9444
Precision Training data:   0.2914
----------------------------------------------------------------------
Recall Test data:          0.8846
Precision Test data:       0.284
----------------------------------------------------------------------
Confusion Matrix Test data
[[5653   90]
 [   2   24]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       1.00      0.99      0.99      5743
         1.0       0.28      0.88      0.43        26

   micro avg       0.99      0.99      0.99      5769
   macro avg       0.64      0.94      0.71      5769
weighted avg       1.00      0.99      0.99      5769

Thanks to apply class_weight, weight of dependent variables marked 1 was increased.
The 'Recall’ gauge increased to the level of 0.88, ratios 'precision’ fall to the level 0.28.

The same effect, can be achieved using automatically balancing with parameter class_weight=”balanced”.

Artykuł Logistic regression model (bank cards operations) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>