Przy budowie modelów klasyfikacji 0-1 występuje problem zbilansowanych zbiorów

źródło: http://sigmaquality.pl/machine-learning/model-regresji-logistycznej-czesc-2-oversampling/

import numpy as np
import pandas as pd
#import xgboost as xgb
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
import matplotlib.pylab as plt

from pylab import plot, show, subplot, specgram, imshow, savefig
from sklearn import preprocessing
#from sklearn import cross_validation, metrics
from sklearn.preprocessing import Normalizer
#from sklearn.cross_validation import cross_val_score
from sklearn.preprocessing import Imputer

import matplotlib.pyplot as plote

plt.style.use('ggplot')

df = pd.read_csv('c:/1/bank.csv')
df.head()

MODEL BEZ ZBILANSOWANIA ZBIORÓW ———————————–

Skalowanie standardowe tylko dla wartości dyskretnych

Wybieram kolumny tekstowe, dyskretne, do głębszej analizy. Lepsze było to wybieranie dyskretne i ciągłe.

encoding_list = ['job', 'marital', 'education', 'default', 'housing', 'loan',
       'contact', 'month', 'day_of_week','poutcome']

df[encoding_list] = df[encoding_list].apply(LabelEncoder().fit_transform)

df[encoding_list].head()

Tworzymy zestaw treningowy i zestaw testowy, budujemy model

y = df['y']
X = df.drop('y', axis=1)

Złoty podział zioru na testowy i treningowy

from sklearn.model_selection import train_test_split 
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.33, stratify = y, random_state = 148)

wielkości zbiorów

print ('Zbiór X treningowy: ',Xtrain.shape)
print ('Zbiór X testowy:    ', Xtest.shape)
print ('Zbiór y treningowy: ', ytrain.shape)
print ('Zbiór y testowy:    ', ytest.shape)

Zbiór X treningowy:  (27595, 22)
Zbiór X testowy:     (13593, 22)
Zbiór y treningowy:  (27595,)
Zbiór y testowy:     (13593,)

array([[11692,   370],
       [  797,   734]], dtype=int64)

              precision    recall  f1-score   support

           0       0.94      0.97      0.95     12062
           1       0.66      0.48      0.56      1531

   micro avg       0.91      0.91      0.91     13593
   macro avg       0.80      0.72      0.75     13593
weighted avg       0.91      0.91      0.91     13593

Accuracy:    0.91
Precision:   0.66
Recall:      0.48
F1 score:    0.56

0    0.887346
1    0.112654
Name: y, dtype: float64

ytrain = 0:  24486
ytrain = 1:  3109

8

24872

24872

ilość elementów w zbiorze Xtrain:      27595
ilość elementów w zbiorze Xtrain_OVS:  52467
ilość elementów w zbiorze ytrain:      27595
ilość elementów w zbiorze ytrain_OVS:  52467

Dane dyskretne są zdygitalizowane

Xtrain.head(4)

Random Forest Classifier

from sklearn import model_selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

forestVC = RandomForestClassifier (random_state = 1, 
                                  n_estimators = 750, 
                                  max_depth = 15, 
                                  min_samples_split = 5, min_samples_leaf = 1) 
modelF = forestVC.fit(Xtrain, ytrain)
y_predF = modelF.predict(Xtest)

Blok oceny jakości modelu Random Forest Classifier

Podstawienie do wzoru

ypred = modelF.predict(Xtest)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(ytest, ypred)
co_matrix

array([[11692,   370],
       [  797,   734]], dtype=int64)

print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.94      0.97      0.95     12062
           1       0.66      0.48      0.56      1531

   micro avg       0.91      0.91      0.91     13593
   macro avg       0.80      0.72      0.75     13593
weighted avg       0.91      0.91      0.91     13593

Accuracy:    0.91
Precision:   0.66
Recall:      0.48
F1 score:    0.56

0    0.887346
1    0.112654
Name: y, dtype: float64

ytrain = 0:  24486
ytrain = 1:  3109

8

24872

24872

ilość elementów w zbiorze Xtrain:      27595
ilość elementów w zbiorze Xtrain_OVS:  52467
ilość elementów w zbiorze ytrain:      27595
ilość elementów w zbiorze ytrain_OVS:  52467

array([[10875,  1187],
       [  268,  1263]], dtype=int64)

              precision    recall  f1-score   support

           0       0.98      0.90      0.94     12062
           1       0.52      0.82      0.63      1531

   micro avg       0.89      0.89      0.89     13593
   macro avg       0.75      0.86      0.79     13593
weighted avg       0.92      0.89      0.90     13593

print("Accuracy:   ",np.round(metrics.accuracy_score(ytest, ypred), decimals=2))
print("Precision:  ",np.round(metrics.precision_score(ytest, ypred), decimals=2))
print("Recall:     ",np.round(metrics.recall_score(ytest, ypred), decimals=2))
print("F1 score:   ",np.round(metrics.f1_score(ytest, ypred), decimals=2))

Accuracy:    0.91
Precision:   0.66
Recall:      0.48
F1 score:    0.56

0    0.887346
1    0.112654
Name: y, dtype: float64

ytrain = 0:  24486
ytrain = 1:  3109

8

24872

24872

ilość elementów w zbiorze Xtrain:      27595
ilość elementów w zbiorze Xtrain_OVS:  52467
ilość elementów w zbiorze ytrain:      27595
ilość elementów w zbiorze ytrain_OVS:  52467

array([[10875,  1187],
       [  268,  1263]], dtype=int64)

              precision    recall  f1-score   support

           0       0.98      0.90      0.94     12062
           1       0.52      0.82      0.63      1531

   micro avg       0.89      0.89      0.89     13593
   macro avg       0.75      0.86      0.79     13593
weighted avg       0.92      0.89      0.90     13593

Accuracy:     0.89
Precision:    0.52
Recall:       0.82
F1 score:     0.63

Analiza poziomu zbilansowania zmiennej wynikowej

df.y.value_counts(dropna = False, normalize=True)

0    0.887346
1    0.112654
Name: y, dtype: float64

print("ytrain = 0: ", sum(ytrain == 0))
print("ytrain = 1: ", sum(ytrain == 1))

ytrain = 0:  24486
ytrain = 1:  3109

8

24872

24872

ilość elementów w zbiorze Xtrain:      27595
ilość elementów w zbiorze Xtrain_OVS:  52467
ilość elementów w zbiorze ytrain:      27595
ilość elementów w zbiorze ytrain_OVS:  52467

array([[10875,  1187],
       [  268,  1263]], dtype=int64)

              precision    recall  f1-score   support

           0       0.98      0.90      0.94     12062
           1       0.52      0.82      0.63      1531

   micro avg       0.89      0.89      0.89     13593
   macro avg       0.75      0.86      0.79     13593
weighted avg       0.92      0.89      0.90     13593

Accuracy:     0.89
Precision:    0.52
Recall:       0.82
F1 score:     0.63

Proporcja = sum(ytrain == 0) / sum(ytrain == 1) 
Proporcja = np.round(Proporcja, decimals=0)
Proporcja = Proporcja.astype(int)
Proporcja

8

Na jedną daną sybskrypcje przypada 8 nieprzedłużonych subskrypcji. Powiększamy liczbę próbek niezależnych.

ytrain_pos_OVS = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
ytrain_pos_OVS.count()

24872

Ilość zmiennych wynikowych: (1) zwiększyła się do liczby 24872
Mamy już wektor zmiennych wynikowych y, teraz trzeba zwiększyć liczbę zmiennych niezależnych.

Powiększamy liczbę próbek zmiennych niezależnych X

Xtrain_pos_OVS = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)

Xtrain_pos_OVS.age.count()

24872

Teraz mamy tą samą liczbę wierszy zmiennych wynikowych i zmiennych niezależnych.

Teraz wprowadzamy nowe, dodatkowe zmienne 1 do zbioru treningowego.

ytrain_OVS = pd.concat([ytrain, ytrain_pos_OVS], axis = 0).reset_index(drop = True)
Xtrain_OVS = pd.concat([Xtrain, Xtrain_pos_OVS], axis = 0).reset_index(drop = True)

Sprawdzamy ilość wierszy w zbiorach przed i po oversampling

print("ilość elementów w zbiorze Xtrain:     ", Xtrain.age.count())
print("ilość elementów w zbiorze Xtrain_OVS: ", Xtrain_OVS.age.count())
print("ilość elementów w zbiorze ytrain:     ", ytrain.count())
print("ilość elementów w zbiorze ytrain_OVS: ", ytrain_OVS.count())

ilość elementów w zbiorze Xtrain:      27595
ilość elementów w zbiorze Xtrain_OVS:  52467
ilość elementów w zbiorze ytrain:      27595
ilość elementów w zbiorze ytrain_OVS:  52467

array([[10875,  1187],
       [  268,  1263]], dtype=int64)

              precision    recall  f1-score   support

           0       0.98      0.90      0.94     12062
           1       0.52      0.82      0.63      1531

   micro avg       0.89      0.89      0.89     13593
   macro avg       0.75      0.86      0.79     13593
weighted avg       0.92      0.89      0.90     13593

Accuracy:     0.89
Precision:    0.52
Recall:       0.82
F1 score:     0.63

Teraz podstawiamy nowy zbiór testowy oversampling do siatki grid według tej same formuły, którą użyliśmy wcześniej.

forestVC = RandomForestClassifier (random_state = 1, 
                                  n_estimators = 750, 
                                  max_depth = 15, 
                                  min_samples_split = 5, min_samples_leaf = 1) 
modelF = forestVC.fit(Xtrain_OVS, ytrain_OVS)

Podstawienie do wzoru

ypred = modelF.predict(Xtest)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(ytest, ypred)
co_matrix

array([[10875,  1187],
       [  268,  1263]], dtype=int64)

print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.98      0.90      0.94     12062
           1       0.52      0.82      0.63      1531

   micro avg       0.89      0.89      0.89     13593
   macro avg       0.75      0.86      0.79     13593
weighted avg       0.92      0.89      0.90     13593

Accuracy:     0.89
Precision:    0.52
Recall:       0.82
F1 score:     0.63

print("Accuracy:    ",np.round(metrics.accuracy_score(ytest, ypred), decimals=2))
print("Precision:   ",np.round(metrics.precision_score(ytest, ypred), decimals=2))
print("Recall:      ",np.round(metrics.recall_score(ytest, ypred), decimals=2))
print("F1 score:    ",np.round(metrics.f1_score(ytest, ypred), decimals=2))

Accuracy:     0.89
Precision:    0.52
Recall:       0.82
F1 score:     0.63

Accuracy: 0.91
Precision: 0.66
Recall: 0.48
F1 score: 0.56

Wynik kodelu po Oversampling

Accuracy: 0.89
Precision: 0.52
Recall: 0.82
F1 score: 0.63

	Unnamed: 0	Unnamed: 0.1	age	job	marital	education	default	housing	loan	contact	…	campaign	pdays	previous	poutcome	emp_var_rate	cons_price_idx	cons_conf_idx	euribor3m	nr_employed	y
0	0	0	44	blue-collar	married	basic.4y	unknown	yes	no	cellular	…	1	999	0	nonexistent	1.4	93.444	-36.1	4.963	5228.1	0
1	1	1	53	technician	married	unknown	no	no	no	cellular	…	1	999	0	nonexistent	-0.1	93.200	-42.0	4.021	5195.8	0
2	2	2	28	management	single	university.degree	no	yes	no	cellular	…	3	6	2	success	-1.7	94.055	-39.8	0.729	4991.6	1
3	3	3	39	services	married	high.school	no	no	no	cellular	…	2	999	0	nonexistent	-1.8	93.075	-47.1	1.405	5099.1	0
4	4	4	55	retired	married	basic.4y	no	yes	no	cellular	…	1	3	1	success	-2.9	92.201	-31.4	0.869	5076.2	1

	Unnamed: 0	Unnamed: 0.1	age	job	marital	education	default	loan	contact	…	duration	campaign	pdays	poutcome	emp_var_rate	cons_price_idx	cons_conf_idx	euribor3m	nr_employed
24697	24697	24697	49	1	1	2	1	0	1	…	222	9	999	1	1.4	94.465	-41.8	4.959	5228.1
25855	25855	25855	38	9	0	6	1	0	0	…	125	3	999	1	1.4	93.444	-36.1	4.963	5228.1
23236	23236	23236	42	0	0	6	0	0	1	…	26	4	999	1	1.4	94.465	-41.8	4.959	5228.1
13812	13812	13812	58	1	1	5	1	2	1	…	25	1	999	1	1.4	94.465	-41.8	4.866	5228.1

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Procedura oversampling dla Random Forest Classifier