Procedura oversampling dla Logistic Regression

źródło: http://sigmaquality.pl/machine-learning/model-regresji-logistycznej-czesc-2-oversampling/

źródło danych: https://archive.ics.uci.edu/ml/machine-learning-databases/00222/

import numpy as np
import pandas as pd
#import xgboost as xgb
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
import matplotlib.pylab as plt

from pylab import plot, show, subplot, specgram, imshow, savefig
from sklearn import preprocessing
#from sklearn import cross_validation, metrics
from sklearn.preprocessing import Normalizer
#from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import Imputer

import matplotlib.pyplot as plote



plt.style.use('ggplot')

df = pd.read_csv('c:/1/bank.csv')
df.head()

MODEL BEZ ZBILANSOWANIA ZBIORÓW

Skalowanie standardowe tylko dla wartości dyskretnychWybieram kolumny tekstowe, dyskretne, do głębszej analizy. Lepsze było to wybieranie dyskretne i ciągłe.

encoding_list = ['job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week','poutcome']

df[encoding_list] = df[encoding_list].apply(LabelEncoder().fit_transform)

df[encoding_list].head()

Tworzymy zestaw treningowy i zestaw testowy, budujemy model

y = df['y']
X = df.drop('y', axis=1)

Złoty podział zioru na testowy i treningowy

from sklearn.model_selection import train_test_split 
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.33, stratify = y, random_state = 148)

Wielkości zbiorów:

print ('Zbiór X treningowy: ',Xtrain.shape)
print ('Zbiór X testowy:    ', Xtest.shape)
print ('Zbiór y treningowy: ', ytrain.shape)
print ('Zbiór y testowy:    ', ytest.shape)

Zbiór X treningowy:  (27595, 22)
Zbiór X testowy:     (13593, 22)
Zbiór y treningowy:  (27595,)
Zbiór y testowy:     (13593,)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[11806,   256],
       [ 1043,   488]], dtype=int64)

              precision    recall  f1-score   support

           0       0.92      0.98      0.95     12062
           1       0.66      0.32      0.43      1531

    accuracy                           0.90     13593
   macro avg       0.79      0.65      0.69     13593
weighted avg       0.89      0.90      0.89     13593

Accuracy:    0.9
Precision:   0.66
Recall:      0.32
F1 score:    0.43

0    0.887346
1    0.112654
Name: y, dtype: float64

ytrain = 0:  24486
ytrain = 1:  3109

8

24872

Dane dyskretne są zdygitalizowane.

Xtrain.head(4)

Logistic Regression

In [10]:

import numpy as np
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)

LR_Grid.fit(Xtrain, ytrain)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[11806,   256],
       [ 1043,   488]], dtype=int64)

              precision    recall  f1-score   support

           0       0.92      0.98      0.95     12062
           1       0.66      0.32      0.43      1531

    accuracy                           0.90     13593
   macro avg       0.79      0.65      0.69     13593
weighted avg       0.89      0.90      0.89     13593

Accuracy:    0.9
Precision:   0.66
Recall:      0.32
F1 score:    0.43

0    0.887346
1    0.112654
Name: y, dtype: float64

ytrain = 0:  24486
ytrain = 1:  3109

8

24872

24872

Blok oceny jakości modelu Logistic Regression

Podstawienie do wzoru

# Podstawienie do wzoru
ypred = LR_Grid.predict(Xtest) 

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(ytest, ypred)
co_matrix

array([[11806,   256],
       [ 1043,   488]], dtype=int64)

print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.92      0.98      0.95     12062
           1       0.66      0.32      0.43      1531

    accuracy                           0.90     13593
   macro avg       0.79      0.65      0.69     13593
weighted avg       0.89      0.90      0.89     13593

Accuracy:    0.9
Precision:   0.66
Recall:      0.32
F1 score:    0.43

0    0.887346
1    0.112654
Name: y, dtype: float64

ytrain = 0:  24486
ytrain = 1:  3109

8

24872

24872

ilość elementów w zbiorze Xtrain:      27595
ilość elementów w zbiorze Xtrain_OVS:  52467
ilość elementów w zbiorze ytrain:      27595
ilość elementów w zbiorze ytrain_OVS:  52467

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

print("Accuracy:   ",np.round(metrics.accuracy_score(ytest, ypred), decimals=2))
print("Precision:  ",np.round(metrics.precision_score(ytest, ypred), decimals=2))
print("Recall:     ",np.round(metrics.recall_score(ytest, ypred), decimals=2))
print("F1 score:   ",np.round(metrics.f1_score(ytest, ypred), decimals=2))

Accuracy:    0.9
Precision:   0.66
Recall:      0.32
F1 score:    0.43

0    0.887346
1    0.112654
Name: y, dtype: float64

ytrain = 0:  24486
ytrain = 1:  3109

8

24872

24872

ilość elementów w zbiorze Xtrain:      27595
ilość elementów w zbiorze Xtrain_OVS:  52467
ilość elementów w zbiorze ytrain:      27595
ilość elementów w zbiorze ytrain_OVS:  52467

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[10082,  1980],
       [  203,  1328]], dtype=int64)

Analiza poziomu zbilansowania zmiennej wynikowej

df.y.value_counts(dropna = False, normalize=True)

0    0.887346
1    0.112654
Name: y, dtype: float64

print("ytrain = 0: ", sum(ytrain == 0))
print("ytrain = 1: ", sum(ytrain == 1))

ytrain = 0:  24486
ytrain = 1:  3109

8

24872

24872

ilość elementów w zbiorze Xtrain:      27595
ilość elementów w zbiorze Xtrain_OVS:  52467
ilość elementów w zbiorze ytrain:      27595
ilość elementów w zbiorze ytrain_OVS:  52467

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[10082,  1980],
       [  203,  1328]], dtype=int64)

              precision    recall  f1-score   support

           0       0.98      0.84      0.90     12062
           1       0.40      0.87      0.55      1531

    accuracy                           0.84     13593
   macro avg       0.69      0.85      0.73     13593
weighted avg       0.92      0.84      0.86     13593

Accuracy:     0.84
Precision:    0.4
Recall:       0.87
F1 score:     0.55

Proporcja = sum(ytrain == 0) / sum(ytrain == 1) 
Proporcja = np.round(Proporcja, decimals=0)
Proporcja = Proporcja.astype(int)
Proporcja

8

Na jedną daną sybskrypcje przypada 8 nieprzedłużonych subskrypcji. Powiększamy liczbę próbek niezależnych.

ytrain_pos_OVS = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
ytrain_pos_OVS.count()

24872

Mamy już wektor zmiennych wynikowych y, teraz trzeba zwiększyć liczbę zmiennych niezależnych

Powiększamy liczbę próbek zmiennych niezależnych X

Xtrain_pos_OVS = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)

Xtrain_pos_OVS.age.count()

24872

Teraz mamy tą samą liczbę wierszy zmiennych wynikowych i zmiennych niezależnych.

Teraz wprowadzamy nowe, dodatkowe zmienne 1 do zbioru treningowego.

ytrain_OVS = pd.concat([ytrain, ytrain_pos_OVS], axis = 0).reset_index(drop = True)
Xtrain_OVS = pd.concat([Xtrain, Xtrain_pos_OVS], axis = 0).reset_index(drop = True)

Sprawdzamy ilość wierszy w zbiorach przed i po oversampling.

print("ilość elementów w zbiorze Xtrain:     ", Xtrain.age.count())
print("ilość elementów w zbiorze Xtrain_OVS: ", Xtrain_OVS.age.count())
print("ilość elementów w zbiorze ytrain:     ", ytrain.count())
print("ilość elementów w zbiorze ytrain_OVS: ", ytrain_OVS.count())

ilość elementów w zbiorze Xtrain:      27595
ilość elementów w zbiorze Xtrain_OVS:  52467
ilość elementów w zbiorze ytrain:      27595
ilość elementów w zbiorze ytrain_OVS:  52467

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[10082,  1980],
       [  203,  1328]], dtype=int64)

              precision    recall  f1-score   support

           0       0.98      0.84      0.90     12062
           1       0.40      0.87      0.55      1531

    accuracy                           0.84     13593
   macro avg       0.69      0.85      0.73     13593
weighted avg       0.92      0.84      0.86     13593

Accuracy:     0.84
Precision:    0.4
Recall:       0.87
F1 score:     0.55

Teraz podstawiamy nowy zbiór testowy oversampling do siatki grid według tej same formuły, którą użyliśmy wcześniej.

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)

LR_Grid.fit(Xtrain_OVS, ytrain_OVS)

C:ProgramDataAnaconda3libsite-packagessklearnlinear_modellogistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

GridSearchCV(cv=2, error_score='raise-deprecating',
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='warn',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='warn',
                                          tol=0.0001, verbose=0,
                                          warm_start=True),
             iid='warn', n_jobs=5,
             param_grid={'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

array([[10082,  1980],
       [  203,  1328]], dtype=int64)

              precision    recall  f1-score   support

           0       0.98      0.84      0.90     12062
           1       0.40      0.87      0.55      1531

    accuracy                           0.84     13593
   macro avg       0.69      0.85      0.73     13593
weighted avg       0.92      0.84      0.86     13593

Accuracy:     0.84
Precision:    0.4
Recall:       0.87
F1 score:     0.55

Podstawienie do wzoru

ypred_OVS = LR_Grid.predict(Xtest)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(ytest, ypred_OVS)
co_matrix

array([[10082,  1980],
       [  203,  1328]], dtype=int64)

print(classification_report(ytest, ypred_OVS))

              precision    recall  f1-score   support

           0       0.98      0.84      0.90     12062
           1       0.40      0.87      0.55      1531

    accuracy                           0.84     13593
   macro avg       0.69      0.85      0.73     13593
weighted avg       0.92      0.84      0.86     13593

Accuracy:     0.84
Precision:    0.4
Recall:       0.87
F1 score:     0.55

print("Accuracy:    ",np.round(metrics.accuracy_score(ytest, ypred_OVS), decimals=2))
print("Precision:   ",np.round(metrics.precision_score(ytest, ypred_OVS), decimals=2))
print("Recall:      ",np.round(metrics.recall_score(ytest, ypred_OVS), decimals=2))
print("F1 score:    ",np.round(metrics.f1_score(ytest, ypred_OVS), decimals=2))

Accuracy:     0.84
Precision:    0.4
Recall:       0.87
F1 score:     0.55

Wynik modelu Logistic Regression przed Oversampling

Accuracy: 0.9
Precision: 0.66
Recall: 0.32
F1 score: 0.43

Wynik kodelu po Oversampling

Accuracy: 0.84
Precision: 0.4
Recall: 0.87
F1 score: 0.55

	Unnamed: 0	Unnamed: 0.1	age	job	marital	education	default	housing	loan	contact	…	campaign	pdays	previous	poutcome	emp_var_rate	cons_price_idx	cons_conf_idx	euribor3m	nr_employed	y
0	0	0	44	blue-collar	married	basic.4y	unknown	yes	no	cellular	…	1	999	0	nonexistent	1.4	93.444	-36.1	4.963	5228.1	0
1	1	1	53	technician	married	unknown	no	no	no	cellular	…	1	999	0	nonexistent	-0.1	93.200	-42.0	4.021	5195.8	0
2	2	2	28	management	single	university.degree	no	yes	no	cellular	…	3	6	2	success	-1.7	94.055	-39.8	0.729	4991.6	1
3	3	3	39	services	married	high.school	no	no	no	cellular	…	2	999	0	nonexistent	-1.8	93.075	-47.1	1.405	5099.1	0
4	4	4	55	retired	married	basic.4y	no	yes	no	cellular	…	1	3	1	success	-2.9	92.201	-31.4	0.869	5076.2	1

	Unnamed: 0	Unnamed: 0.1	age	job	marital	education	default	loan	contact	…	duration	campaign	pdays	poutcome	emp_var_rate	cons_price_idx	cons_conf_idx	euribor3m	nr_employed
24697	24697	24697	49	1	1	2	1	0	1	…	222	9	999	1	1.4	94.465	-41.8	4.959	5228.1
25855	25855	25855	38	9	0	6	1	0	0	…	125	3	999	1	1.4	93.444	-36.1	4.963	5228.1
23236	23236	23236	42	0	0	6	0	0	1	…	26	4	999	1	1.4	94.465	-41.8	4.959	5228.1
13812	13812	13812	58	1	1	5	1	2	1	…	25	1	999	1	1.4	94.465	-41.8	4.866	5228.1

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Oversampling dla Logistic Regression

Procedura oversampling dla Logistic Regression

Tworzymy zestaw treningowy i zestaw testowy, budujemy model

Logistic Regression

Blok oceny jakości modelu Logistic Regression

Analiza poziomu zbilansowania zmiennej wynikowej