Conglomerate of the models #4: Comparative analysis of classification models, bagging, cross-validation

040620201204

conglomerate of the models

Cognition comes by comparison! Friedrich Wilhelm Nietzsche

The best knowledge can be obtained by comparing many models from different perspectives. In this study, I used data about people studying whether they would have a stroke or not.Withh the help of predict_prob with a properly working model, it is possible to provide each test person with the probability that they will have a stroke. By controlling age, you can show him the likelihood of stroke in the next years of life. The database is easily available on the internet.

In my research I used this database but changed the formula of the analysis slightly. I imagined it was a test for the currently popular COVID-19. With tests on COVID-19 it is like this:

  1. It is important and even crucial that they have tests give high detectability because if we release a patient who will have COVID-19 and we will say that he is healthy. It can infect many people and put them at risk of losing their lives. Poor detection also means having to repeat tests, which is expensive and inefficient.
  2. It is also important not to scare healthy people that they have COVID-19.

obraz.png
Friedrich Wilhelm Nietzsche

My test is a great research laboratory for classification models:
At the same time, it examines 9 classification models from different families, and compiles the assessment indicators of these models into one table. In addition, it strengthens all models by loading – i.e. we already have 18 honeys. Then the program can calibrate models, which means 36 models. Another function is cross-validation which is an effective way to improve model properties even though it takes a terrible amount of time. After applying the calibration we already have 54 models.

In this exercise, I gave up calibrating models but this option still remains changing: calibration = True: calibration = True.
I also reduced the number of cross-validation variants so that the calculations were completed during my lifetime.
Illustrations come from my beloved city of Venice during the COVIT-19 epidemic.

Conglomerate of the models #3: CALIBRATING of the MODELS

obraz.png

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings 
from sklearn.ensemble import BaggingClassifier
from simple_colors import * 
from prettytable import PrettyTable

warnings.filterwarnings("ignore")

%matplotlib inline

df= pd.read_csv('/home/wojciech/Pulpit/1/Stroke_Prediction.csv')
print(df.shape)
print()
print(df.columns)
df.head(3)
(43400, 12)

Index(['ID', 'Gender', 'Age_In_Days', 'Hypertension', 'Heart_Disease',
       'Ever_Married', 'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI',
       'Smoking_Status', 'Stroke'],
      dtype='object')
Out[1]:
ID Gender Age_In_Days Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke
0 31153 Male 1104.0 0 0 No children Rural 95.12 18.0 NaN 0
1 30650 Male 21204.0 1 0 Yes Private Urban 87.96 39.2 never smoked 0
2 17412 Female 2928.0 0 0 No Private Urban 110.89 17.6 NaN 0

Sample reduction:

In [2]:
df = df.sample(frac = 0.5, random_state=10) 
df.shape
Out[2]:
(21700, 12)

Start pomiaru czasu

In [3]:
import time
start_time = time.time() ## pomiar czasu: start pomiaru czasu
print(time.ctime())
Wed Jun  3 22:38:49 2020

Tool for automatic coding of discrete variables

obraz.png

In [4]:
a,b = df.shape     #<- ile mamy kolumn
b

print('DISCRETE FUNCTIONS CODED')
print('------------------------')
for i in range(1,b):
    i = df.columns[i]
    f = df[i].dtypes
    if f == np.object:
        print(i,"---",f)   
    
        if f == np.object:
        
            df[i] = pd.Categorical(df[i]).codes
        
            continue
DISCRETE FUNCTIONS CODED
------------------------
Gender --- object
Ever_Married --- object
Type_Of_Work --- object
Residence --- object
Smoking_Status --- object
In [5]:
df.fillna(7777, inplace=True)
In [6]:
X = df.drop('Stroke', axis=1) 
y = df['Stroke']  

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)

X_test = X_test.values
y_test = y_test.values
X_train = X_train.values
y_train = y_train.values

Oversampling !!

obraz.png

In [7]:
def oversampling(ytrain, Xtrain):
    import matplotlib.pyplot as plt
    
    global Xtrain_OV
    global ytrain_OV

    calss1 = np.round((sum(ytrain == 1)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    calss0 = np.round((sum(ytrain == 0)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    
    print("y = 0: ", sum(ytrain == 0),'-------',calss0,'%')
    print("y = 1: ", sum(ytrain == 1),'-------',calss1,'%')
    print('--------------------------------------------------------')
    
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    print()
    
    Proporcja = sum(ytrain == 0) / sum(ytrain == 1)
    Proporcja = np.round(Proporcja, decimals=0)
    Proporcja = Proporcja.astype(int)
       
    ytrain_OV = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
    Xtrain_OV = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
    
    ytrain_OV = pd.concat([ytrain, ytrain_OV], axis = 0).reset_index(drop = True)
    Xtrain_OV = pd.concat([Xtrain, Xtrain_OV], axis = 0).reset_index(drop = True)
    
    Xtrain_OV = pd.DataFrame(Xtrain_OV)
    ytrain_OV = pd.DataFrame(ytrain_OV)
    

    
    print("Before oversampling Xtrain:     ", Xtrain.shape)
    print("Before oversampling ytrain:     ", ytrain.shape)
    print('--------------------------------------------------------')
    print("After oversampling Xtrain_OV:  ", Xtrain_OV.shape)
    print("After oversampling ytrain_OV:  ", ytrain_OV.shape)
    print('--------------------------------------------------------')
    
    
    ax = plt.subplot(1, 2, 1)
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    
       
    kot = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0)
    kot = pd.concat([ytrain, kot], axis = 0).reset_index(drop = True)
    ax = plt.subplot(1, 2, 2)
    kot.value_counts(dropna = False, normalize=True).plot(kind='pie',title='After oversampling')
    plt.show
In [8]:
oversampling(y_train, X_train)
y = 0:  17040 ------- 98.0 %
y = 1:  320 ------- 2.0 %
--------------------------------------------------------

Before oversampling Xtrain:      (17360, 11)
Before oversampling ytrain:      (17360,)
--------------------------------------------------------
After oversampling Xtrain_OV:   (34320, 11)
After oversampling ytrain_OV:   (34320, 1)
--------------------------------------------------------

I used six models of GaussianNB, LogisticRegression, CatBoostClassifier in their basic version without oversamplin and with oversampling. We will see what differences in the minority set classification the oversampling method gives.!!

I get rid of one dimension from the ytrain_OV set so that the set is similar to y_test.

In [9]:
print(Xtrain_OV.shape)
print(ytrain_OV.shape)
ytrain_OV = ytrain_OV['Stroke']
print(ytrain_OV.shape)
(34320, 11)
(34320, 1)
(34320,)

W poprzednim wpisie uznaliśmy, że oversampling poprawiło jakość klasyfikacji. Kolejne działania będą opierały sie na danych zbilansowanych przez oversampling. Dlatego teraz podmieniamy zwykłą próbę na próbę po oversamoling.

In [10]:
X_train = Xtrain_OV
y_train = ytrain_OV
print(X_train.shape)
print(y_train.shape)
(34320, 11)
(34320,)

Oversampling for cross-validation

Cross-validation should be done for a full set. therefore oversampling should be performed for a balanced set first. In my exercise I took a shortcut and did oversampling on a training set. However, I left this option. There are a lot of working but disabled analytical functions in this file.
obraz.png

In [11]:
X = df.drop('Stroke', axis=1) 
y = df['Stroke']  
In [12]:
X.shape
Out[12]:
(21700, 11)
In [13]:
oversampling(y, X)
y = 0:  21300 ------- 98.0 %
y = 1:  400 ------- 2.0 %
--------------------------------------------------------

Before oversampling Xtrain:      (21700, 11)
Before oversampling ytrain:      (21700,)
--------------------------------------------------------
After oversampling Xtrain_OV:   (42900, 11)
After oversampling ytrain_OV:   (42900, 1)
--------------------------------------------------------
In [14]:
Data = Xtrain_OV
target = ytrain_OV
print("output:",Data.shape)
print("output:",target.shape)
print('----------')
print("input:", df.shape)
output: (42900, 11)
output: (42900, 1)
----------
input: (21700, 12)

I create 4 groups of classifiers:

  1. Normal classifiers after oversampling,
  2. Classifiers after bagging
  3. Standard calibrators
  4. Classifiers after bagging calibrated

Below are 2 basic groups: 1. Classifiers after oversampling, 2. Classifiers after bagging

In [15]:
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier
from sklearn.svm import SVC 
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC 
import time

NBC = GaussianNB()
LRE = LogisticRegression(solver='lbfgs')
GBC = GradientBoostingClassifier()
RFC = RandomForestClassifier()
LGBM = LGBMClassifier() 
CBC = CatBoostClassifier(verbose=0, n_estimators=100)
XGB = XGBClassifier()
LREN = LogisticRegression(solver='newton-cg')
KNN = KNeighborsClassifier(n_neighbors=1, p=2)
SVM = SVC(probability=True) 


classifiers_A = [SVM,CBC,XGB,LGBM,KNN,NBC,LRE,RFC,GBC]
nameA = ['SVM','CBC','XGB','LGBM','KNN','NBC','LRE','RFC','GBC']

for n,t in zip(nameA,classifiers_A):          ## Szkolenie modeli w pętli
    start_time = time.time()
    t.fit(X_train, y_train)
    p = np.round((time.time() - start_time),decimals=1)
    print(blue(n),p,"---",time.ctime())



### Wzmacnianie przez bagowanie!

NBC_b = BaggingClassifier(base_estimator=NBC, n_estimators=10, max_samples=0.8, max_features=0.8)
LRE_b = BaggingClassifier(base_estimator=LRE, n_estimators=10, max_samples=0.8, max_features=0.8)
GBC_b = BaggingClassifier(base_estimator=GBC, n_estimators=10, max_samples=0.8, max_features=0.8)
RFC_b = BaggingClassifier(base_estimator=RFC, n_estimators=10, max_samples=0.8, max_features=0.8)
LGBM_b = BaggingClassifier(base_estimator=LGBM, n_estimators=10, max_samples=0.8, max_features=0.8)
CBC_b = BaggingClassifier(base_estimator=CBC, n_estimators=10, max_samples=0.8, max_features=0.8)
XGB_b = BaggingClassifier(base_estimator=XGB, n_estimators=10, max_samples=0.8, max_features=0.8)
SVM_b = BaggingClassifier(base_estimator=SVM, n_estimators=10, max_samples=0.8, max_features=0.8)
KNN_b = BaggingClassifier(base_estimator=KNN, n_estimators=10, max_samples=0.8, max_features=0.8)



classifiers_B = [SVM_b,CBC_b,XGB_b,LGBM_b,KNN_b,NBC_b,LRE_b,RFC_b,GBC_b]
nameB = ['SVM_b','CBC_b','XGB_b','LGBM_b','KNN_b','NBC_b','LRE_b','RFC_b','GBC_b']
 
print('-------------------------------------')
    
for f,p in zip(nameB,classifiers_B):            ## Szkolenie zbagowanych modeli w pętli
    start_time = time.time()
    p.fit(X_train, y_train)         
    k = np.round((time.time() - start_time),decimals=1)
    print(blue(f),k,"---",time.ctime())
SVM 268.5 --- Wed Jun  3 22:43:19 2020
CBC 1.3 --- Wed Jun  3 22:43:20 2020
XGB 2.2 --- Wed Jun  3 22:43:23 2020
LGBM 1.2 --- Wed Jun  3 22:43:24 2020
KNN 0.1 --- Wed Jun  3 22:43:24 2020
NBC 0.0 --- Wed Jun  3 22:43:24 2020
LRE 0.5 --- Wed Jun  3 22:43:24 2020
RFC 3.7 --- Wed Jun  3 22:43:28 2020
GBC 5.3 --- Wed Jun  3 22:43:33 2020
-------------------------------------
SVM_b 857.7 --- Wed Jun  3 22:57:51 2020
CBC_b 6.6 --- Wed Jun  3 22:57:58 2020
XGB_b 11.5 --- Wed Jun  3 22:58:09 2020
LGBM_b 3.7 --- Wed Jun  3 22:58:13 2020
KNN_b 0.5 --- Wed Jun  3 22:58:13 2020
NBC_b 0.1 --- Wed Jun  3 22:58:13 2020
LRE_b 2.1 --- Wed Jun  3 22:58:15 2020
RFC_b 17.6 --- Wed Jun  3 22:58:33 2020
GBC_b 25.6 --- Wed Jun  3 22:58:59 2020
In [ ]:
 

Cross validation

Division into folds for all models:

In [16]:
from sklearn.model_selection import RepeatedStratifiedKFold

cv_method = RepeatedStratifiedKFold(n_splits=5,            # 5-krotna weryfikacja krzyżowa
                                    n_repeats=3,           # z 3-ema powtórzeniami
                                    random_state=999)

A set of hyperparameters for each model:

In [17]:
params_KNN = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7], 'p': [1, 2, 5]}
params_NBC = {'var_smoothing': np.logspace(0,-9, num=100)}
params_LRE = {'C': np.power(10.0, np.arange(-3, 3))}
params_RFC = {
 'max_depth': [5, 8],
 'max_features': ['auto'],
 'min_samples_leaf': [1, 2],
 'min_samples_split': [2, 5],
 'n_estimators':  [100,  200, ]}

#n_estimators = [100, 300, 500, 800, 1200]
#max_depth = [5, 8, 15, 25, 30]
#min_samples_split = [2, 5, 10, 15, 100]
#min_samples_leaf = [1, 2, 5, 10]

params_RFC2 = {
 'max_depth': [2, 3],
 'min_samples_leaf': [3, 4],
 'n_estimators':  [500,1000]}

params_GBC = {
    'min_samples_split':range(1000,2100,200),
    'min_samples_leaf':range(30,71,10)}

#{'max_depth':range(5,16,2), 'min_samples_split':range(200,1001,200)}
#{'n_estimators':range(20,81,10)}
# https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/

params_GBC2 = {
    'max_depth':range(5,16,2),
    'min_samples_split':range(200,1001),
    'n_estimators':range(20,81)}

params_CBC = {'learning_rate': [0.03, 0.1],
        'depth': [4, 6, 10],
        'l2_leaf_reg': [1, 3, 5, 7, 9]}


params_LGBM = {'max_depth': [3, 6, 9],'learning_rate': [0.001, 0.01, 0.05]}

params_XGB = {"learning_rate": [0.05, 0.15, 0.25 ], "max_depth": [ 3, 6, 9], "gamma":[ 0.0, 0.1, 0.4 ] }

params_SVM = {'C': [0.1,1, 10, 100], 'kernel': ['rbf']}
## {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'sigmoid']}

params_SVM2 = {'C': [0.1,1, 10, 100],'kernel': ['poly']}
## {'C': [0.1,1, 10, 100], 'gamma': [1,0.1,0.01,0.001],'kernel': ['rbf', 'poly', 'sigmoid']}

from sklearn.model_selection import GridSearchCV

SVM2 = SVC(probability=True)
RFC2 = RandomForestClassifier()

classifiers = [SVM,SVM2,XGB,LGBM,KNN,NBC,LRE,RFC,RFC2]
params = [params_SVM,params_SVM2,params_XGB,params_LGBM,params_KNN,params_NBC,params_LRE,params_RFC,params_RFC2]
names = [‘gs_SVM’,’gs_SVM2′,’gs_XGB’,’gs_LGBM’,’gs_KNN’,’gs_NBC’,’gs_LRE’,’gs_RFC’,’gs_RFC2′]

for w,t,m in zip(classifiers, params, names):
GridSearchCV(estimator=w,
param_grid=t,
cv=cv_method,
verbose=1, # verbose: the higher, the more messages
scoring=’roc_auc’,
return_train_score=True)

Inserting each model into the grid seat:

In [18]:
from sklearn.model_selection import GridSearchCV


##==============================================================================

gs_KNN = GridSearchCV(estimator=KNeighborsClassifier(), 
                      param_grid=params_KNN, 
                      cv=cv_method,
                      verbose=1,  # verbose: the higher, the more messages
                      scoring='roc_auc', 
                      return_train_score=True)

##==============================================================================

gs_NBC = GridSearchCV(estimator=NBC, 
                     param_grid=params_NBC, 
                     cv=cv_method,
                     verbose=1, 
                     scoring='roc_auc')

##==============================================================================

gs_LRE = GridSearchCV(estimator=LRE,
                      param_grid = params_LRE,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 

##==============================================================================

gs_RFC = GridSearchCV(estimator=RFC,
                      param_grid = params_RFC,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 

##==============================================================================

gs_RFC2 = GridSearchCV(estimator=RFC,
                      param_grid = params_RFC2,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 

##==============================================================================

gs_GBC = GridSearchCV(estimator=GBC,
                      param_grid = params_GBC,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 

##==============================================================================

gs_GBC2 = GridSearchCV(estimator=GBC,
                      param_grid = params_GBC2,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 

##==============================================================================

gs_CBC = GridSearchCV(estimator=GBC,
                      param_grid = params_CBC,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 

##==============================================================================

gs_XGB = GridSearchCV(estimator=XGB,
                      param_grid = params_XGB,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 

##==============================================================================

gs_LGBM = GridSearchCV(estimator=GBC,
                      param_grid = params_LGBM,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 
##==============================================================================

gs_SVM = GridSearchCV(estimator=SVM,
                      param_grid = params_SVM,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 

##==============================================================================

gs_SVM2 = GridSearchCV(estimator=SVM,
                      param_grid = params_SVM2,
                      cv=cv_method,
                      verbose=1,
                      scoring = 'roc_auc') 

Exercise model using the full range of balanced data (after oversample):

In [19]:
classifiers_F = [gs_SVM,gs_SVM2,gs_XGB,gs_LGBM,gs_KNN,gs_NBC,gs_LRE,gs_RFC,gs_RFC2]
nameF = ['gs_SVM','gs_SVM2','gs_XGB','gs_LGBM','gs_KNN','gs_NBC','gs_LRE','gs_RFC','gs_RFC2']
In [20]:
for n,t in zip(nameF,classifiers_F):          ## Szkolenie modeli w pętli
    start_time = time.time()
    t.fit(X_train, y_train)
    p = np.round((time.time() - start_time),decimals=1)
    print(blue(n),p,"---",time.ctime())
Fitting 15 folds for each of 4 candidates, totalling 60 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed: 183.0min finished
gs_SVM 11241.6 --- Thu Jun  4 02:06:20 2020
Fitting 15 folds for each of 4 candidates, totalling 60 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  60 out of  60 | elapsed: 216.1min finished
gs_SVM2 13208.1 --- Thu Jun  4 05:46:28 2020
Fitting 15 folds for each of 27 candidates, totalling 405 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 405 out of 405 | elapsed:  7.4min finished
gs_XGB 447.6 --- Thu Jun  4 05:53:56 2020
Fitting 15 folds for each of 9 candidates, totalling 135 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 135 out of 135 | elapsed: 17.5min finished
gs_LGBM 1064.2 --- Thu Jun  4 06:11:40 2020
Fitting 15 folds for each of 21 candidates, totalling 315 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 315 out of 315 | elapsed:  7.3min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
gs_KNN 437.5 --- Thu Jun  4 06:18:58 2020
Fitting 15 folds for each of 100 candidates, totalling 1500 fits
[Parallel(n_jobs=1)]: Done 1500 out of 1500 | elapsed:   29.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
gs_NBC 29.1 --- Thu Jun  4 06:19:27 2020
Fitting 15 folds for each of 6 candidates, totalling 90 fits
[Parallel(n_jobs=1)]: Done  90 out of  90 | elapsed:   17.5s finished
gs_LRE 18.1 --- Thu Jun  4 06:19:45 2020
Fitting 15 folds for each of 16 candidates, totalling 240 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 240 out of 240 | elapsed: 11.4min finished
gs_RFC 686.6 --- Thu Jun  4 06:31:12 2020
Fitting 15 folds for each of 8 candidates, totalling 120 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed: 16.0min finished
gs_RFC2 973.2 --- Thu Jun  4 06:47:25 2020

Checking the best set of hyperparameters:

In [21]:
print('Best params gs_SVM:', gs_SVM.best_params_)
print('Best params gs_SVM2:', gs_SVM2.best_params_)
print('Best params gs_XGB:', gs_XGB.best_params_)
print('Best params gs_LGBM:', gs_LGBM.best_params_)
print('Best params gs_KNN:', gs_KNN.best_params_)
print('Best params gs_NBC:', gs_NBC.best_params_)
print('Best params gs_LRE:', gs_LRE.best_params_)
print('Best params gs_RFC:', gs_RFC.best_params_)
print('Best params gs_RFC2:', gs_RFC2.best_params_)
Best params gs_SVM: {'C': 1, 'kernel': 'rbf'}
Best params gs_SVM2: {'C': 10, 'kernel': 'poly'}
Best params gs_XGB: {'gamma': 0.4, 'learning_rate': 0.25, 'max_depth': 9}
Best params gs_LGBM: {'learning_rate': 0.05, 'max_depth': 9}
Best params gs_KNN: {'n_neighbors': 1, 'p': 5}
Best params gs_NBC: {'var_smoothing': 0.01873817422860384}
Best params gs_LRE: {'C': 0.01}
Best params gs_RFC: {'max_depth': 8, 'max_features': 'auto', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 200}
Best params gs_RFC2: {'max_depth': 3, 'min_samples_leaf': 3, 'n_estimators': 1000}

Time is money obraz.png

## Pomiar czasu – który model mi opóźnia obliczenia!?
def time_is_money(six_classifiers, name):

from sklearn.calibration import CalibratedClassifierCV, calibration_curve

print(blue(‘Time measurement for models in seconds’,’bold’))
import time
timer_P = [‘Time for model: ‘]
timer_C = [‘Time for model calibration: ‘]

def compute(model):

start_time = time.time()
model.fit(X_train, y_train)
g =((time.time() – start_time))
g = np.round(g,decimals=1)

start_time = time.time()
calibrated = CalibratedClassifierCV(model, method=’sigmoid’, cv=5)
calibrated.fit(X_train, y_train)
c =((time.time() – start_time))
c = np.round(c,decimals=1)

return g,c

for t,cls in zip(name,six_classifiers):
results = compute(cls)
timer_P.append(results[0])
timer_C.append(results[1])

t = PrettyTable([‘Name’, name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
t.add_row(timer_P)
t.add_row(timer_C)
print(t)

The most important type of Type_error set for me

False_Positive_Rate

- percentage share of healthy people recognized by the model as sick in the population of healthy people

True_Positive_Rate RECALL

- this indicator shows how detectable the disease is by the model.
In [22]:
def Type_error(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):

    
    from sklearn.datasets import make_classification
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve
    from sklearn.metrics import confusion_matrix
    
    from sklearn import metrics
    import simple_colors
    import time   
    
    start_time = time.time()
    
    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train)    
    
    FPR = ['False_Positive_Rate:']
    TPR = ['True_Positive_Rate: ']
    FNR = ['False_Negative_Rate: ']
    SPEC = ['Specifity']
    
    CAL_FPR = ['CAL_False_Positive_Rate:']
    CAL_TPR = ['CAL_True_Positive_Rate: ']
    CAL_FNR = ['CAL_False_Negative_Rate: ']
    CAL_SPEC = ['CAL_Specifity']

    def compute_metric(model):
        
        
        #model = model.fit(X_train,y_train)   #<-- model już przećwiczył się na pełnych danych
        cm = confusion_matrix(y_test, model.predict(X_test))
        tn, fp, fn, tp = cm.ravel()
        # print('tn: ',tn)
        # print('fp: ',fp)
        # print('fn: ',fn)
        # print('tp: ',tp)
        # print('------------------')
        # print(cm) 
        

        FPR = np.round(fp/(fp + tn),decimals=3)
        TPR = np.round(tp/(tp + fn),decimals=3)
        FNR = np.round(fn/(tp + fn),decimals=3)
        SPEC = np.round(tn/(tn + fp),decimals=3)

        return FPR,TPR,FNR,SPEC

    for cls in six_classifiers:      
        
        results = compute_metric(cls)
        FPR.append(red(results[0],'bold'))
        TPR.append(red(results[1],'bold'))
        FNR.append(results[2])
        SPEC.append(results[3])

    t = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
    t.add_row(FPR)
    t.add_row(TPR)
    t.add_row(FNR)
    t.add_row(SPEC)

    print(blue('Models before calibration','bold'))
    g = (time.time() - start_time)
    g = np.round(g)
    print('time: %s seconds' % g)
    print(t)
  ## --------------------------------------------------  
    
    if calibration != True:
        print()
    else:    
        print(blue('Models after calibration','bold'))
  
        start_time = time.time()
    
        def calibration(model):
        
            calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
            calibrated.fit(X_train, y_train)
        
            ck = confusion_matrix(y_test, calibrated.predict(X_test))
            tn_c, fp_c, fn_c, tp_c = ck.ravel()
            # print('tn: ',tn)
            # print('fp: ',fp)
            # print('fn: ',fn)
            # print('tp: ',tp)
            # print('------------------')
            # print(cm) 
        

            CAL_FPR = np.round(fp_c/(fp_c + tn_c),decimals=3)
            CAL_TPR = np.round(tp_c/(tp_c + fn_c),decimals=3)
            CAL_FNR = np.round(fn_c/(tp_c + fn_c),decimals=3)
            CAL_SPEC = np.round(tn_c/(tn_c + fp_c),decimals=3)

            return CAL_FPR, CAL_TPR, CAL_FNR, CAL_SPEC

        for cls in six_classifiers:      
        
            results = calibration(cls)
            CAL_FPR.append(red(results[0],'bold'))
            CAL_TPR.append(red(results[1],'bold'))
            CAL_FNR.append(results[2])
            CAL_SPEC.append(results[3])

        k = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
        k.add_row(CAL_FPR)
        k.add_row(CAL_TPR)
        k.add_row(CAL_FNR)
        k.add_row(CAL_SPEC)
    
    
        n = (time.time() - start_time)
        n = np.round(n)
        print('time: %s seconds' % n)
        print(k)
    
    
    print(red('False_Positive_Rate','bold'),red('procentowy udział ludzi zdrowych uznanych przez model za chorych w populacji ludzi zdrowych','italic'))
    print(red('True_Positive_Rate RECALL','bold'), red('procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi chorych ogółem','italic'))
    print(black('False_Negative_Rate','bold'), black('procentowy udział niewykrytych chorych w populacji ludzi chorych ogółem','italic'))
    print(black('Specifity','bold'), black('procentowy udział ludzi zdrowych uznanych za zdrowych w populacji ludzi zdrowych','italic'))
In [23]:
Type_error(classifiers_A,nameA,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 4.0 seconds
+-----------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|          Name         |  SVM  |  CBC  |  XGB  |  LGBM |  KNN  |  NBC  |  LRE  |  RFC  |  GBC  |
+-----------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|  False_Positive_Rate: | 0.315 | 0.074 | 0.018 | 0.065 |  0.02 | 0.216 | 0.285 |  0.0  | 0.201 |
|  True_Positive_Rate:  | 0.838 |  0.3  | 0.038 | 0.312 | 0.075 | 0.775 |  0.8  | 0.012 |  0.7  |
| False_Negative_Rate:  | 0.162 |  0.7  | 0.962 | 0.688 | 0.925 | 0.225 |  0.2  | 0.988 |  0.3  |
|       Specifity       | 0.685 | 0.926 | 0.982 | 0.935 |  0.98 | 0.784 | 0.715 |  1.0  | 0.799 |
+-----------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

False_Positive_Rate procentowy udział ludzi zdrowych uznanych przez model za chorych w populacji ludzi zdrowych
True_Positive_Rate RECALL procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi chorych ogółem
False_Negative_Rate procentowy udział niewykrytych chorych w populacji ludzi chorych ogółem
Specifity procentowy udział ludzi zdrowych uznanych za zdrowych w populacji ludzi zdrowych
In [24]:
Type_error(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 7.0 seconds
+-----------------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
|          Name         | gs_SVM | gs_SVM2 | gs_XGB | gs_LGBM | gs_KNN | gs_NBC | gs_LRE | gs_RFC | gs_RFC2 |
+-----------------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
|  False_Positive_Rate: | 0.315  |  0.216  | 0.007  |  0.052  | 0.019  | 0.315  | 0.287  | 0.192  |  0.343  |
|  True_Positive_Rate:  | 0.838  |   0.75  | 0.025  |  0.288  | 0.075  | 0.838  |  0.8   |  0.65  |  0.825  |
| False_Negative_Rate:  | 0.162  |   0.25  | 0.975  |  0.712  | 0.925  | 0.162  |  0.2   |  0.35  |  0.175  |
|       Specifity       | 0.685  |  0.784  | 0.993  |  0.948  | 0.981  | 0.685  | 0.713  | 0.808  |  0.657  |
+-----------------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+

False_Positive_Rate procentowy udział ludzi zdrowych uznanych przez model za chorych w populacji ludzi zdrowych
True_Positive_Rate RECALL procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi chorych ogółem
False_Negative_Rate procentowy udział niewykrytych chorych w populacji ludzi chorych ogółem
Specifity procentowy udział ludzi zdrowych uznanych za zdrowych w populacji ludzi zdrowych
In [25]:
Type_error(classifiers_B,nameB,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 27.0 seconds
+-----------------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
|          Name         | SVM_b | CBC_b | XGB_b | LGBM_b | KNN_b | NBC_b | LRE_b | RFC_b | GBC_b |
+-----------------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
|  False_Positive_Rate: | 0.296 |  0.06 | 0.017 | 0.041  | 0.003 |  0.17 | 0.418 |  0.0  | 0.212 |
|  True_Positive_Rate:  | 0.838 | 0.325 |  0.05 | 0.275  | 0.012 | 0.712 | 0.875 | 0.012 | 0.675 |
| False_Negative_Rate:  | 0.162 | 0.675 |  0.95 | 0.725  | 0.988 | 0.288 | 0.125 | 0.988 | 0.325 |
|       Specifity       | 0.704 |  0.94 | 0.983 | 0.959  | 0.997 |  0.83 | 0.582 |  1.0  | 0.788 |
+-----------------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+

False_Positive_Rate procentowy udział ludzi zdrowych uznanych przez model za chorych w populacji ludzi zdrowych
True_Positive_Rate RECALL procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi chorych ogółem
False_Negative_Rate procentowy udział niewykrytych chorych w populacji ludzi chorych ogółem
Specifity procentowy udział ludzi zdrowych uznanych za zdrowych w populacji ludzi zdrowych

The SVM model, after being strengthened by baging, has a patient detection rate (RECALL) of 84%. Unfortunately, 30% of healthy people are indicated as sick. A similar result was achieved by the NBC (Gaussian NB) model after cross-vaidation and a slightly worse result for RFC2 (Random Forest Classifier) also after cross-validation.
In terms of computation time combined with the quality of the classification (i.e. the detection of recall and false judging of healthy people, the LRE model is unrivaled (recall 0.8 and False_Positive_Rate: 0.28. This model calculates the result very quickly and very well).

As Nietzsche recalls, the comparison leads to cognition. The combination of many models gives a lot of knowledge.

Confusion matrix

In [26]:
def confusion_matrix(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):
    
    from matplotlib import rcParams
    rcParams['axes.titlepad'] = 20 
    
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve 
    from sklearn.metrics import plot_confusion_matrix
    
    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train) 
    
    fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(14,10))
    target_names = ['0','1']


    for t,cls, ax in zip(name, six_classifiers, axes.flatten()):
        plot_confusion_matrix(cls, 
                              X_test, 
                              y_test, 
                              ax=ax, 
                              cmap='Blues',
                             display_labels=target_names,values_format='')
        ax.title.set_text(type(cls).__name__)
        ax.title.set_color('blue')
        ax.text(-0.5, -0.56, t,fontsize=12)
        ax.text(-0.5, 1.40, 'before calibration',color='black', fontsize=10) 
        
    plt.tight_layout()  
    plt.show()
    
### ---------------------------------------------------
    if calibration != True:
        print()
    else:    
        print(blue('Models after calibration','bold'))

    ### ---------------------------------------------------
    
    
        for cls in six_classifiers:
            calibrated = CalibratedClassifierCV(cls, method='sigmoid', cv=5)
            calibrated.fit(X_train, y_train)
    
        fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(14,10))
        target_names = ['0','1']


        for t,cls, ax in zip(name, six_classifiers, axes.flatten()):
            plot_confusion_matrix(cls, 
                                  X_test, 
                                  y_test, 
                                  ax=ax, 
                                  cmap='Blues',
                                 display_labels=target_names,values_format='')
            ax.title.set_text(type(cls).__name__)
            ax.title.set_color('blue')
            ax.text(-0.5, -0.56, t,fontsize=12)
            ax.text(-0.5, 1.40, 'after calibration',color='red', fontsize=10)    ## podtytuł     
In [27]:
confusion_matrix(classifiers_A,nameA,X_train, y_train,X_test,y_test,calibration=False)

In [28]:
confusion_matrix(classifiers_B,nameB,X_train, y_train,X_test,y_test,calibration=False)

In [29]:
confusion_matrix(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)

Recall – Precision!obraz.png

In [30]:
def Recall_Precision(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):

    from sklearn.datasets import make_classification
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve  
    
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    from sklearn.metrics import accuracy_score
    from mlxtend.plotting import plot_learning_curves
    from prettytable import PrettyTable
    import time   
    
    start_time = time.time()
    
    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train)
      
    
    Recall_Training = ['Recall Training: ']
    Precision_Training = ['Precision Training: ']
    Recall_Test= ['Recall Test: ']
    Precision_Test = ['Precision Test: ']
    
    CAL_Recall_Training = ['CAL_Recall Training: ']
    CAL_Precision_Training = ['CAL_Precision Training: ']
    CAL_Recall_Test= ['CAL_Recall Test: ']
    CAL_Precision_Test = ['CAL_Precision Test: ']   

    def compute_metric2(model):

        Recall_Training = np.round(recall_score(y_train, model.predict(X_train)), decimals=3)
        Precision_Training = np.round(precision_score(y_train, model.predict(X_train)), decimals=3)
        Recall_Test = np.round(recall_score(y_test, model.predict(X_test)), decimals=3) 
        Precision_Test = np.round(precision_score(y_test, model.predict(X_test)), decimals=3)
        
        return Recall_Training, Precision_Training, Recall_Test, Precision_Test
    
    for cls in six_classifiers:

        results = compute_metric2(cls)
        Recall_Training.append(results[0])
        Precision_Training.append(results[1])
        Recall_Test.append(blue(results[2],'bold'))
        Precision_Test.append((blue(results[3],'bold')))
   
    
    t = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
    t.add_row(Recall_Training)
    t.add_row(Precision_Training)
    t.add_row(Recall_Test)
    t.add_row(Precision_Test)

    
    print(blue('Models before calibration','bold'))
    g = (time.time() - start_time)
    g = np.round(g)
    print('time: %s seconds' % g)
    print(t)
    
  ### ---------------------------------------------------------  
    
    
    if calibration != True:
        print()
    else:    
        print(blue('Models after calibration','bold'))

            
        def calibration(model):
        
            calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
            calibrated.fit(X_train, y_train)
       
            
        
            CAL_Recall_Training = np.round(recall_score(y_train, calibrated.predict(X_train)), decimals=3)
            CAL_Precision_Training = np.round(precision_score(y_train, calibrated.predict(X_train)), decimals=3)
            CAL_Recall_Test = np.round(recall_score(y_test, calibrated.predict(X_test)), decimals=3) 
            CAL_Precision_Test = np.round(precision_score(y_test, calibrated.predict(X_test)), decimals=3)
        
            return CAL_Recall_Training, CAL_Precision_Training, CAL_Recall_Test, CAL_Precision_Test 
    
        start_time = time.time()
    
        for cls in six_classifiers:

            results = calibration(cls)
            CAL_Recall_Training.append(results[0])
            CAL_Precision_Training.append(results[1])
            CAL_Recall_Test.append(blue(results[2],'bold'))
            CAL_Precision_Test.append((blue(results[3],'bold')))
   
        k = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
        k.add_row(CAL_Recall_Training)
        k.add_row(CAL_Precision_Training)
        k.add_row(CAL_Recall_Test)
        k.add_row(CAL_Precision_Test)   
    
    
        n = (time.time() - start_time)
        n = np.round(n)
        print('time: %s seconds' % n)
        print(k)
    print(blue('Wskaźniki pokazują RECALL i PRECISION dla klasy 1','bold'))
    print(blue('RECALL', 'bold'), blue('procentowy udział chorych dobrze zdiagnozowanych wśród wszystkich ludzi chorych','italic'))
    print(blue('PRECISION', 'bold'), blue('procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi zdiagnozowanych fałszywie (zdrowych uznanych przez model za chorych) i dobrze zdiagnozowanych (chorych uznanych przez model za chorych)','italic'))
In [31]:
Recall_Precision(classifiers_A,nameA,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 77.0 seconds
+----------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|         Name         |  SVM  |  CBC  |  XGB  |  LGBM |  KNN  |  NBC  |  LRE  |  RFC  |  GBC  |
+----------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|  Recall Training:    | 0.869 |  1.0  |  1.0  |  1.0  |  1.0  | 0.766 | 0.794 |  1.0  | 0.919 |
| Precision Training:  | 0.734 | 0.944 | 0.993 | 0.948 |  1.0  | 0.776 | 0.737 |  1.0  | 0.815 |
|    Recall Test:      | 0.838 |  0.3  | 0.038 | 0.312 | 0.075 | 0.775 |  0.8  | 0.012 |  0.7  |
|   Precision Test:    | 0.048 | 0.071 | 0.038 | 0.082 | 0.067 | 0.063 |  0.05 |  0.5  | 0.061 |
+----------------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

Wskaźniki pokazują RECALL i PRECISION dla klasy 1
RECALL procentowy udział chorych dobrze zdiagnozowanych wśród wszystkich ludzi chorych
PRECISION procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi zdiagnozowanych fałszywie (zdrowych uznanych przez model za chorych) i dobrze zdiagnozowanych (chorych uznanych przez model za chorych)
In [32]:
Recall_Precision(classifiers_B,nameB,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 471.0 seconds
+----------------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
|         Name         | SVM_b | CBC_b | XGB_b | LGBM_b | KNN_b | NBC_b | LRE_b | RFC_b | GBC_b |
+----------------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
|  Recall Training:    | 0.831 |  1.0  |  1.0  |  1.0   |  1.0  | 0.638 | 0.862 |  1.0  | 0.922 |
| Precision Training:  | 0.737 | 0.955 | 0.994 | 0.967  |  1.0  | 0.783 | 0.672 |  1.0  | 0.811 |
|    Recall Test:      | 0.838 | 0.325 |  0.05 | 0.275  | 0.012 | 0.712 | 0.875 | 0.012 | 0.675 |
|   Precision Test:    |  0.05 | 0.093 | 0.053 | 0.111  | 0.083 | 0.073 | 0.038 |  1.0  | 0.056 |
+----------------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+

Wskaźniki pokazują RECALL i PRECISION dla klasy 1
RECALL procentowy udział chorych dobrze zdiagnozowanych wśród wszystkich ludzi chorych
PRECISION procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi zdiagnozowanych fałszywie (zdrowych uznanych przez model za chorych) i dobrze zdiagnozowanych (chorych uznanych przez model za chorych)
In [33]:
Recall_Precision(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 116.0 seconds
+----------------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
|         Name         | gs_SVM | gs_SVM2 | gs_XGB | gs_LGBM | gs_KNN | gs_NBC | gs_LRE | gs_RFC | gs_RFC2 |
+----------------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
|  Recall Training:    | 0.869  |  0.753  |  1.0   |   1.0   |  1.0   |  0.85  | 0.797  | 0.941  |  0.894  |
| Precision Training:  | 0.734  |  0.778  |  1.0   |  0.962  |  1.0   | 0.728  | 0.736  | 0.836  |  0.723  |
|    Recall Test:      | 0.838  |   0.75  | 0.025  |  0.288  | 0.075  | 0.838  |  0.8   |  0.65  |  0.825  |
|   Precision Test:    | 0.048  |  0.061  | 0.065  |  0.093  | 0.067  | 0.048  |  0.05  |  0.06  |  0.043  |
+----------------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+

Wskaźniki pokazują RECALL i PRECISION dla klasy 1
RECALL procentowy udział chorych dobrze zdiagnozowanych wśród wszystkich ludzi chorych
PRECISION procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi zdiagnozowanych fałszywie (zdrowych uznanych przez model za chorych) i dobrze zdiagnozowanych (chorych uznanych przez model za chorych)

Classification score

In [34]:
def classification_score(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):

    from sklearn.datasets import make_classification
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve   
    from sklearn.metrics import precision_recall_fscore_support as score
    import time   
    
    start_time = time.time()

    Precision_0 = ['Precision_0: ']
    Precision_1 = ['Precision_1: ']
    Recall_0 = ['Recall_0: ']
    Recall_1 = ['Recall_1: ']
    f1_score_0 = ['f1-score_0: ']
    f1_score_1 = ['f1-score_1: ']
    Support_0 = ['Support_0: ']
    Support_1 = ['Support_1: ']
    
    
    CAL_Precision_0 = ['CAL_Precision_0: ']
    CAL_Precision_1 = ['CAL_Precision_1: ']
    CAL_Recall_0 = ['CAL_Recall_0: ']
    CAL_Recall_1 = ['CAL_Recall_1: ']
    CAL_f1_score_0 = ['CAL_f1-score_0: ']
    CAL_f1_score_1 = ['CAL_f1-score_1: ']
    CAL_Support_0 = ['CAL_Support_0: ']
    CAL_Support_1 = ['CAL_Support_1: ']

    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train)
        
    
    def compute_metric4(model):

        precision, recall, fscore, support = score(y_test, model.predict(X_test))
    
        Precision_0 = np.round(precision[:1],decimals=3).item()
        Precision_1 = np.round(precision[1:],decimals=3).item()
        Recall_0 = np.round(recall[:1],decimals=3).item()
        Recall_1 = np.round(recall[1:],decimals=3).item()
        f1_score_0 = np.round(fscore[:1],decimals=3).item()
        f1_score_1 = np.round(fscore[1:],decimals=3).item()
        Support_0 = np.round(support[:1],decimals=3).item()
        Support_1 = np.round(support[1:],decimals=3).item()
        
        return Precision_0, Precision_1, Recall_0, Recall_1, f1_score_0, f1_score_1, Support_0, Support_1

    for cls in six_classifiers:

        results = compute_metric4(cls)
        Precision_0.append(results[0])
        Precision_1.append(blue(results[1],'bold'))
        Recall_0.append(results[2])
        Recall_1.append(blue(results[3],'bold'))
        f1_score_0.append(results[4])
        f1_score_1.append(blue(results[5],'bold'))
        Support_0.append(results[6])
        Support_1.append(blue(results[7],'bold'))
         

    t = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
    t.add_row(Precision_0)
    t.add_row(Precision_1)
    t.add_row(Recall_0)
    t.add_row(Recall_1)
    t.add_row(f1_score_0)
    t.add_row(f1_score_1)
    t.add_row(Support_0)
    t.add_row(Support_1)


    print(blue('Models before calibration','bold'))
    g = (time.time() - start_time)
    g = np.round(g)
    print('time: %s seconds' % g)
    print(t)
    
   ## ------------------------------------------

    if calibration != True:
        print()
    else:    
        print(blue('Models after calibration','bold'))
  
        start_time = time.time()
    
        def calibration(model):
        
            calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
            calibrated.fit(X_train, y_train)
            precision, recall, fscore, support = score(y_test, calibrated.predict(X_test))
                
            CAL_Precision_0 = np.round(precision[:1],decimals=3).item()
            CAL_Precision_1 = np.round(precision[1:],decimals=3).item()
            CAL_Recall_0 = np.round(recall[:1],decimals=3).item()
            CAL_Recall_1 = np.round(recall[1:],decimals=3).item()
            CAL_f1_score_0 = np.round(fscore[:1],decimals=3).item()
            CAL_f1_score_1 = np.round(fscore[1:],decimals=3).item()
            CAL_Support_0 = np.round(support[:1],decimals=3).item()
            CAL_Support_1 = np.round(support[1:],decimals=3).item()
        
            return CAL_Precision_0, CAL_Precision_1, CAL_Recall_0, CAL_Recall_1, CAL_f1_score_0, CAL_f1_score_1, CAL_Support_0, CAL_Support_1
    
        for cls in six_classifiers:

            results = calibration(cls)
            CAL_Precision_0.append(results[0])
            CAL_Precision_1.append(blue(results[1],'bold'))
            CAL_Recall_0.append(results[2])
            CAL_Recall_1.append(blue(results[3],'bold'))
            CAL_f1_score_0.append(results[4])
            CAL_f1_score_1.append(blue(results[5],'bold'))
            CAL_Support_0.append(results[6])
            CAL_Support_1.append(blue(results[7],'bold'))
   
        k = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
        k.add_row(CAL_Precision_0)
        k.add_row(CAL_Precision_1)
        k.add_row(CAL_Recall_0)
        k.add_row(CAL_Recall_1)
        k.add_row(CAL_f1_score_0)
        k.add_row(CAL_f1_score_1)
        k.add_row(CAL_Support_0)
        k.add_row(CAL_Support_1)
    
        n = (time.time() - start_time)
        n = np.round(n)
        print('time: %s seconds' % n)
        print(k)
    print(blue('RECALL', 'bold'), blue('procentowy udział chorych dobrze zdiagnozowanych wśród wszystkich ludzi chorych','italic'))
    print(blue('PRECISION', 'bold'), blue('procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi zdiagnozowanych fałszywie (zdrowych uznanych przez model za chorych) i dobrze zdiagnozowanych (chorych uznanych przez model za chorych)','italic')) 
    
In [35]:
classification_score(classifiers_A,nameA,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 4.0 seconds
+---------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|      Name     |  SVM  |  CBC  |  XGB  |  LGBM |  KNN  |  NBC  |  LRE  |  RFC  |  GBC  |
+---------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| Precision_0:  | 0.996 | 0.986 | 0.982 | 0.986 | 0.983 | 0.995 | 0.995 | 0.982 | 0.993 |
| Precision_1:  | 0.048 | 0.071 | 0.038 | 0.082 | 0.067 | 0.063 |  0.05 |  0.5  | 0.061 |
|   Recall_0:   | 0.685 | 0.926 | 0.982 | 0.935 |  0.98 | 0.784 | 0.715 |  1.0  | 0.799 |
|   Recall_1:   | 0.838 |  0.3  | 0.038 | 0.312 | 0.075 | 0.775 |  0.8  | 0.012 |  0.7  |
|  f1-score_0:  | 0.811 | 0.955 | 0.982 |  0.96 | 0.981 | 0.877 | 0.832 | 0.991 | 0.886 |
|  f1-score_1:  |  0.09 | 0.115 | 0.038 |  0.13 | 0.071 | 0.117 | 0.094 | 0.024 | 0.113 |
|  Support_0:   |  4260 |  4260 |  4260 |  4260 |  4260 |  4260 |  4260 |  4260 |  4260 |
|  Support_1:   |   80  |   80  |   80  |   80  |   80  |   80  |   80  |   80  |   80  |
+---------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

RECALL procentowy udział chorych dobrze zdiagnozowanych wśród wszystkich ludzi chorych
PRECISION procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi zdiagnozowanych fałszywie (zdrowych uznanych przez model za chorych) i dobrze zdiagnozowanych (chorych uznanych przez model za chorych)
In [36]:
classification_score(classifiers_B,nameB,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 27.0 seconds
+---------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
|      Name     | SVM_b | CBC_b | XGB_b | LGBM_b | KNN_b | NBC_b | LRE_b | RFC_b | GBC_b |
+---------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
| Precision_0:  | 0.996 | 0.987 | 0.982 | 0.986  | 0.982 | 0.994 | 0.996 | 0.982 | 0.992 |
| Precision_1:  |  0.05 | 0.093 | 0.053 | 0.111  | 0.083 | 0.073 | 0.038 |  1.0  | 0.056 |
|   Recall_0:   | 0.704 |  0.94 | 0.983 | 0.959  | 0.997 |  0.83 | 0.582 |  1.0  | 0.788 |
|   Recall_1:   | 0.838 | 0.325 |  0.05 | 0.275  | 0.012 | 0.712 | 0.875 | 0.012 | 0.675 |
|  f1-score_0:  | 0.825 | 0.963 | 0.983 | 0.972  |  0.99 | 0.904 | 0.735 | 0.991 | 0.879 |
|  f1-score_1:  | 0.095 | 0.144 | 0.052 | 0.158  | 0.022 | 0.132 | 0.073 | 0.025 | 0.104 |
|  Support_0:   |  4260 |  4260 |  4260 |  4260  |  4260 |  4260 |  4260 |  4260 |  4260 |
|  Support_1:   |   80  |   80  |   80  |   80   |   80  |   80  |   80  |   80  |   80  |
+---------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+

RECALL procentowy udział chorych dobrze zdiagnozowanych wśród wszystkich ludzi chorych
PRECISION procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi zdiagnozowanych fałszywie (zdrowych uznanych przez model za chorych) i dobrze zdiagnozowanych (chorych uznanych przez model za chorych)
In [37]:
classification_score(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 7.0 seconds
+---------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
|      Name     | gs_SVM | gs_SVM2 | gs_XGB | gs_LGBM | gs_KNN | gs_NBC | gs_LRE | gs_RFC | gs_RFC2 |
+---------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
| Precision_0:  | 0.996  |  0.994  | 0.982  |  0.986  | 0.983  | 0.996  | 0.995  | 0.992  |  0.995  |
| Precision_1:  | 0.048  |  0.061  | 0.065  |  0.093  | 0.067  | 0.048  |  0.05  |  0.06  |  0.043  |
|   Recall_0:   | 0.685  |  0.784  | 0.993  |  0.948  | 0.981  | 0.685  | 0.713  | 0.808  |  0.657  |
|   Recall_1:   | 0.838  |   0.75  | 0.025  |  0.288  | 0.075  | 0.838  |  0.8   |  0.65  |  0.825  |
|  f1-score_0:  | 0.811  |  0.877  | 0.988  |  0.966  | 0.982  | 0.812  | 0.831  | 0.891  |  0.791  |
|  f1-score_1:  |  0.09  |  0.113  | 0.036  |  0.141  | 0.071  |  0.09  | 0.094  |  0.11  |  0.082  |
|  Support_0:   |  4260  |   4260  |  4260  |   4260  |  4260  |  4260  |  4260  |  4260  |   4260  |
|  Support_1:   |   80   |    80   |   80   |    80   |   80   |   80   |   80   |   80   |    80   |
+---------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+

RECALL procentowy udział chorych dobrze zdiagnozowanych wśród wszystkich ludzi chorych
PRECISION procentowy udział chorych dobrze zdiagnozowanych w populacji ludzi zdiagnozowanych fałszywie (zdrowych uznanych przez model za chorych) i dobrze zdiagnozowanych (chorych uznanych przez model za chorych)

AUC score

In [38]:
def AUC_score(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):
    
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve
    from sklearn import metrics
    import time   
    
    start_time = time.time()

    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train)    
    
    AUC_train = ['AUC_train: ']
    AUC_test = ['AUC_test: ']
    CAL_AUC_train = ['AUC_train: ']
    CAL_AUC_test = ['AUC_test: ']
    
    
    def compute_metric(model):

        auc_train = np.round(metrics.roc_auc_score(y_train,model.predict_proba(X_train)[:,1]),decimals=3)
        auc_test = np.round(metrics.roc_auc_score(y_test,model.predict_proba(X_test)[:,1]),decimals=3)

        return auc_train, auc_test

    for cls in six_classifiers:

        results = compute_metric(cls)
        AUC_train.append(results[0])
        AUC_test.append(blue(results[1],'bold'))


    t = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
    t.add_row(AUC_train)
    t.add_row(AUC_test)
    
    
    print(blue('Models before calibration','bold'))
    g = (time.time() - start_time)
    g = np.round(g)
    print('time: %s secondS' % g)
    print(t)
    
    if calibration != True:
        print()
    else:    
        print(blue('Models after calibration','bold'))
    
        start_time = time.time()
    
        def calibration(model):
        
            calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
            calibrated.fit(X_train, y_train)
         
            CAL_AUC_train = np.round(metrics.roc_auc_score(y_train,calibrated.predict_proba(X_train)[:,1]),decimals=3)
            CAL_AUC_test = np.round(metrics.roc_auc_score(y_test,calibrated.predict_proba(X_test)[:,1]),decimals=3)

            return CAL_AUC_train, CAL_AUC_test

    
        for cls in six_classifiers:

            results = calibration(cls)
            CAL_AUC_train.append(results[0])
            CAL_AUC_test.append(blue(results[1],'bold'))
       
   
        k = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
        k.add_row(CAL_AUC_train)
        k.add_row(CAL_AUC_test)
    
        n = (time.time() - start_time)
        n = np.round(n)
        print('time: %s seconds' % n)    
        print(k)
    
In [39]:
AUC_score(classifiers_A,nameA,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 37.0 secondS
+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
|     Name    |  SVM  |  CBC  |  XGB  |  LGBM |  KNN  |  NBC  |  LRE  |  RFC  |  GBC  |
+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| AUC_train:  |  0.85 | 0.996 |  1.0  | 0.999 |  1.0  |  0.84 | 0.827 |  1.0  |  0.93 |
|  AUC_test:  | 0.848 | 0.784 | 0.755 | 0.794 | 0.528 | 0.847 | 0.844 | 0.802 | 0.823 |
+-------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+

In [40]:
AUC_score(classifiers_B,nameB,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 236.0 secondS
+-------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
|     Name    | SVM_b | CBC_b | XGB_b | LGBM_b | KNN_b | NBC_b | LRE_b | RFC_b | GBC_b |
+-------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
| AUC_train:  | 0.844 | 0.999 |  1.0  |  1.0   |  1.0  | 0.836 | 0.793 |  1.0  | 0.937 |
|  AUC_test:  |  0.85 | 0.827 | 0.787 | 0.819  | 0.611 | 0.844 | 0.782 | 0.827 | 0.817 |
+-------------+-------+-------+-------+--------+-------+-------+-------+-------+-------+

In [41]:
AUC_score(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 56.0 secondS
+-------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
|     Name    | gs_SVM | gs_SVM2 | gs_XGB | gs_LGBM | gs_KNN | gs_NBC | gs_LRE | gs_RFC | gs_RFC2 |
+-------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
| AUC_train:  |  0.85  |  0.849  |  1.0   |   1.0   |  1.0   | 0.845  | 0.827  | 0.958  |  0.859  |
|  AUC_test:  | 0.848  |  0.847  | 0.772  |  0.779  | 0.528  | 0.843  | 0.843  | 0.806  |  0.831  |
+-------------+--------+---------+--------+---------+--------+--------+--------+--------+---------+

Binary Classficators Plots obraz.png

In [42]:
def BinaryClassPlot(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):
    
    import time
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve
    from matplotlib import rcParams      ## Robie odstęp na podtytuł
    rcParams['axes.titlepad'] = 20 
    
    start_time = time.time()
    
    from plot_metric.functions import BinaryClassification

    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train) 
       
    plt.figure(figsize=(15,10))
    grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)
        
    for i in range(9):
        col, row = i%3,i//3
        ax = plt.subplot(grid[row,col]) 
        ax.title.set_color('blue')
            
        model = six_classifiers[i]
        bc = BinaryClassification(y_test, model.predict_proba(X_test)[:,1], labels=["Class 1", "Class 2"])
        bc.plot_roc_curve(title=type(six_classifiers[i]).__name__)
        ax.text(0.0, 1.09, 'before calibration',color='black', fontsize=10) 
        ax.text(0.5, 1.09, name[i],fontsize=10)    ## podtytuł

 ### ------------------------------------------------------------------------------       
    if calibration != True:
        print()
    else:    
           
        #for cls in six_classifiers:
        #    cls.fit(X_train, y_train)
                            
        plt.figure(figsize=(15,10))
        grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)

        for i in range(9):
            col, row = i%3,i//3
            ax = plt.subplot(grid[row,col]) 
            ax.title.set_color('blue')
            
            model = six_classifiers[i]
            calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
            calibrated.fit(X_train, y_train)        
                
            bc = BinaryClassification(y_test, calibrated.predict_proba(X_test)[:,1], labels=["Class 1", "Class 2"])
            bc.plot_roc_curve(title=type(six_classifiers[i]).__name__)
            ax.text(0.0, 1.09, 'after calibration',color='red', fontsize=10)    ## podtytuł     
            ax.text(0.5, 1.09, name[i],fontsize=10)    ## podtytuł   
        
        
        n = (time.time() - start_time)
        n = np.round(n)
        print('time: %s seconds' % n)    
In [43]:
BinaryClassPlot(classifiers_A,nameA,X_train, y_train,X_test,y_test,calibration=False)

In [44]:
BinaryClassPlot(classifiers_B,nameB,X_train, y_train,X_test,y_test,calibration=False)

In [45]:
BinaryClassPlot(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)

The analysis shows that the best models in the basic version and after bagging are LRE and SVC.
In the version after cross-validation, SVM, NBC, LRE, RFC2, SVM2 deserve attention.

ROC AUC plots

In [46]:
def plot_roc(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):
    
    import time
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve
    from matplotlib import rcParams      ## Robie odstęp na podtytuł
    rcParams['axes.titlepad'] = 20 
    
    import scikitplot as skplt
    
    start_time = time.time()
    
    plt.figure(figsize=(15,10))
    grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)

    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train)

    for i in range(9):

        col, row = i%3,i//3
        ax = plt.subplot(grid[row,col]) 
        ax.title.set_color('blue')

        model = classifiers_A[i]
        skplt.metrics.plot_roc(y_test, model.predict_proba(X_test), ax=ax, title=type(six_classifiers[i]).__name__)
        ax.text(0.5, 1.09, name[i],fontsize=10)    ## podtytuł
        ax.text(0.0, 1.09, 'before calibration',color='black', fontsize=10)
## ---------------------------------------------------------------------------------------------------
    
    if calibration != True:
        print()
    else:    
    
    
        plt.figure(figsize=(15,10))
        grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)
    
    
        #for cls in six_classifiers:
        #    cls.fit(X_train, y_train)

        for i in range(9):

            col, row = i%3,i//3
            ax = plt.subplot(grid[row,col]) 
            ax.title.set_color('blue')

            model = classifiers_A[i]
            calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
            calibrated.fit(X_train, y_train)        
        
            skplt.metrics.plot_roc(y_test, calibrated.predict_proba(X_test), ax=ax, title=type(six_classifiers[i]).__name__)
            ax.text(0.5, 1.09, name[i],fontsize=10)    ## podtytuł
            ax.text(0.0, 1.09, 'after calibration',color='red', fontsize=10)    ## podtytuł  
    
    n = (time.time() - start_time)
    n = np.round(n)
    print('time: %s seconds' % n)    
In [47]:
plot_roc(classifiers_A,nameA,X_train, y_train,X_test,y_test,calibration=False)
time: 5.0 seconds
In [48]:
plot_roc(classifiers_B,nameB,X_train, y_train,X_test,y_test,calibration=False)
time: 5.0 seconds
In [49]:
plot_roc(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)
time: 5.0 seconds

W tym teście szczególnie waża jest różnica pomiedzy krzywą micro-average ROC pokazaną na różowo oraz krzywą macro-average ROC pokazana na granatowo.
Idealnie gdy obie krzywe się pokrywają. Zbilansowanie klas prze oversampling poprawiło w wielu medelach spójność obu krzywych, w niektórych jednak pozostały duże różnice.

Jeżeli:

macro average ROC > micro average ROC
wtedy mówimy, że: “1 (minority) is better classified than 0 (majority) – macro > micro”

Jeżeli:

macro average ROC micro average ROC
wtedy mówimy, że: ‘0 (majority) is better classified than 1 (minority)- micro < macro’

Idealnie gdy krzywe micro i macro pokrywają się ze sobą. Taka sytuacja ma miejsce po oversampling w GaussianNB oraz GradientBoostingClassifier.

def calibration_curve2(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=False):

from matplotlib import rcParams ## Robie odstęp na podtytuł
rcParams[‘axes.titlepad’] = 20

import scikitplot as skplt
from sklearn.calibration import CalibratedClassifierCV, calibration_curve

plt.figure(figsize=(15,10))
grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)

#for cls in six_classifiers:
# cls.fit(X_train, y_train)

for i in range(9):

col, row = i%3,i//3
ax = plt.subplot(grid[row,col])
ax.title.set_color(‘blue’)

model = classifiers_A[i]
A_probas = model.fit(X_train, y_train).predict_proba(X_test)
probas_list = [A_probas]

clf_names = [name[i]]

skplt.metrics.plot_calibration_curve(y_test,probas_list,clf_names,title=type(six_classifiers[i]).__name__,ax=ax)
ax.text(0.5, 1.09, name[i],fontsize=10) ## podtytuł
ax.text(0.0, 1.09, ‘before calibration’,color=’black’, fontsize=10)
### ———————————————————————————–

if calibration != True:
print()
else:
plt.figure(figsize=(15,10))
grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)

#for cls in six_classifiers:
# cls.fit(X_train, y_train)

for i in range(9):

col, row = i%3,i//3
ax = plt.subplot(grid[row,col])
ax.title.set_color(‘blue’)

model = classifiers_A[i]
calibrated = CalibratedClassifierCV(model, method=’sigmoid’, cv=5)
calibrated.fit(X_train, y_train)

A_probas = calibrated.fit(X_train, y_train).predict_proba(X_test)
probas_list = [A_probas]

clf_names = [name[i]]

skplt.metrics.plot_calibration_curve(y_test,probas_list,clf_names,title=type(six_classifiers[i]).__name__,ax=ax)
ax.text(0.5, 1.09, name[i],fontsize=10) ## podtytuł
ax.text(0.0, 1.09, ‘after calibration’,color=’red’, fontsize=10) ## podtytuł calibration_curve2(classifiers_A,nameA,X_train, y_train,X_test,y_test,calibration=False)calibration_curve2(classifiers_B,nameB,X_train, y_train,X_test,y_test,calibration=False)calibration_curve2(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)

Cohen Kappa Metric


$ bbox[20px,border:1px solid red]
{
κ = displaystylefrac{(p_o – p_e)}{(1 – p_e)}=1-frac{1 – p_e}{1 – p_e}
qquad
} $

where:

$ p_0 = displaystylefrac{(tn+??)}{(tn+fp+fn+??)}$

$ p_{empire} = displaystylefrac{(tn+fp)}{(tn+fp+fn+??)}timesfrac{(tn+fn)}{(tn+fp+fn+??)}$

$ p_{theory} = displaystylefrac{(fn+??)}{(tn+fp+fn+??)}timesfrac{(fp+??)}{(tn+fp+fn+??)}$

$ p_e = p_{empire}+p_{theory}$

obraz.png

In [50]:
def Cohen_Kappa(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=False):
    
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve
    from sklearn import metrics
    import simple_colors
    import time   
    
    start_time = time.time()
    
    κ = ['κ:']
    p0 = ['p0:']
    pe = ['pe:']
    
    κc = ['κ:']
    p0c = ['p0:']
    pec = ['pe:']    
    
    
    
    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train)    
    
    def compute_metric(model):
        
        from sklearn.metrics import confusion_matrix

        #model.fit(X_train,y_train)
        cm = confusion_matrix(y_test, model.predict(X_test))
        tn, fp, fn, tp = cm.ravel()     
        
        p0 = (tn+??)/(tn+fp+fn+??)
        P_empire = ((tn+fp)/(tn+fp+fn+??))*((tn+fn)/(tn+fp+fn+??))
        P_theory = ((fn+??)/(tn+fp+fn+??))*((fp+??)/(tn+fp+fn+??))
        pe = P_empire + P_theory
        κ = (p0-pe)/(1-pe)
        
        κ = np.round(κ,decimals=3)
        p0 = np.round(p0,decimals=3)
        pe = np.round(pe,decimals=3)
        
        return κ,p0, pe

    for cls in six_classifiers:
        
        results = compute_metric(cls)
        κ.append(blue(results[0],'bold'))
        p0.append(results[1])
        pe.append(results[2])
      

    t = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
    t.add_row(p0)
    t.add_row(pe)
    t.add_row(κ)

    print(blue('Models before calibration','bold'))
    g = (time.time() - start_time)
    g = np.round(g)
    print('time: %s second' % g)
    print(t)   
    print()
  ###------------------------------------------------------------  
    
    if calibration != True:
        print()
    else:   
        print(blue('Models after calibration','bold'))
        
        plt.figure(figsize=(15,10))
        grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)
        
        start_time = time.time()
    
        def compute_metric2(model):
            
            calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
            calibrated.fit(X_train, y_train)     
            
            calibrated = calibrated.fit(X_train,y_train)
            cm = confusion_matrix(y_test, calibrated.predict(X_test))
            tn, fp, fn, tp = cm.ravel()
   
        
            p0c = (tn+??)/(tn+fp+fn+??)
            P_empire = ((tn+fp)/(tn+fp+fn+??))*((tn+fn)/(tn+fp+fn+??))
            P_theory = ((fn+??)/(tn+fp+fn+??))*((fp+??)/(tn+fp+fn+??))
            pec = P_empire + P_theory
            κc = (p0c-pec)/(1-pec)
        
            κc = np.round(κc,decimals=3)
            p0c = np.round(p0c,decimals=3)
            pec = np.round(pec,decimals=3)
        
            return κc,p0c, pec

        for cls in six_classifiers:
        
            results = compute_metric2(cls)
            κc.append(blue(results[0],'bold'))
            p0c.append(results[1])
            pec.append(results[2])
      

        k = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
        k.add_row(p0c)
        k.add_row(pec)
        k.add_row(κc)

        n = (time.time() - start_time)
        n = np.round(n)
        print('time: %s second' % n)         
        print(k)
    
    print(blue('Obserwowana zgodność p0', 'underlined'))
    print(black('Jest to prawdopodobieństwo dobrego wyboru, to procent przypadków, które zostały sklasyfikowane prawidłowo w całej matrycy zamieszania, czyli prawdziwi chorzy zostali sklasyfikowani jako chorzy a prawdziwie zdrowi sklasyfikowani jako prawdziwie zdrowi','italic'))
    print(blue('Oczekiwana zgodność pe', 'underlined'))
    print(black('Jest to prawdopodobieństwo wyboru bezpośrednio związana z liczbą wystąpień każdej klasy. Jeżeli wystąpień klas było po równo (np. 1: 20 wystąpień i 0: 20 wystapień), czyli zbiór był zbilansowany, to prawdopodobieństwo wynosi 50%. ','italic'))
    print(blue('Cohen Kappa mówi, o ile lepszy jest model klasyfikacji (p0) od losowego klasyfikatora(pe), który przewiduje na podstawie częstotliwości klas.','italic'))
    print(black(''))
    print(black('Statystyka może być ujemna, co oznacza, że nie ma skutecznej zgodności między dwoma wskaźnikami lub zgodność jest gorsza niż losowa.'))
In [51]:
Cohen_Kappa(classifiers_A,nameA, X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 4.0 second
+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| Name |  SVM  |  CBC  |  XGB  |  LGBM |  KNN  |  NBC  |  LRE  |  RFC  |  GBC  |
+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+
| p0:  | 0.688 | 0.915 | 0.965 | 0.923 | 0.964 | 0.784 | 0.717 | 0.982 | 0.797 |
| pe:  | 0.669 | 0.907 | 0.964 | 0.914 | 0.962 | 0.763 | 0.698 | 0.981 | 0.779 |
|  κ:  | 0.057 | 0.088 |  0.02 | 0.104 | 0.052 | 0.085 | 0.062 | 0.024 | 0.082 |
+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+


Obserwowana zgodność p0
Jest to prawdopodobieństwo dobrego wyboru, to procent przypadków, które zostały sklasyfikowane prawidłowo w całej matrycy zamieszania, czyli prawdziwi chorzy zostali sklasyfikowani jako chorzy a prawdziwie zdrowi sklasyfikowani jako prawdziwie zdrowi
Oczekiwana zgodność pe
Jest to prawdopodobieństwo wyboru bezpośrednio związana z liczbą wystąpień każdej klasy. Jeżeli wystąpień klas było po równo (np. 1: 20 wystąpień i 0: 20 wystapień), czyli zbiór był zbilansowany, to prawdopodobieństwo wynosi 50%. 
Cohen Kappa mówi, o ile lepszy jest model klasyfikacji (p0) od losowego klasyfikatora(pe), który przewiduje na podstawie częstotliwości klas.

Statystyka może być ujemna, co oznacza, że nie ma skutecznej zgodności między dwoma wskaźnikami lub zgodność jest gorsza niż losowa.
In [52]:
Cohen_Kappa(classifiers_B,nameB, X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 27.0 second
+------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
| Name | SVM_b | CBC_b | XGB_b | LGBM_b | KNN_b | NBC_b | LRE_b | RFC_b | GBC_b |
+------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
| p0:  | 0.706 | 0.929 | 0.966 | 0.946  | 0.979 | 0.827 | 0.587 | 0.982 | 0.786 |
| pe:  | 0.687 | 0.919 | 0.965 | 0.938  | 0.979 | 0.808 | 0.571 | 0.981 | 0.769 |
|  κ:  | 0.063 | 0.119 | 0.034 | 0.136  | 0.017 | 0.102 | 0.039 | 0.024 | 0.073 |
+------+-------+-------+-------+--------+-------+-------+-------+-------+-------+


Obserwowana zgodność p0
Jest to prawdopodobieństwo dobrego wyboru, to procent przypadków, które zostały sklasyfikowane prawidłowo w całej matrycy zamieszania, czyli prawdziwi chorzy zostali sklasyfikowani jako chorzy a prawdziwie zdrowi sklasyfikowani jako prawdziwie zdrowi
Oczekiwana zgodność pe
Jest to prawdopodobieństwo wyboru bezpośrednio związana z liczbą wystąpień każdej klasy. Jeżeli wystąpień klas było po równo (np. 1: 20 wystąpień i 0: 20 wystapień), czyli zbiór był zbilansowany, to prawdopodobieństwo wynosi 50%. 
Cohen Kappa mówi, o ile lepszy jest model klasyfikacji (p0) od losowego klasyfikatora(pe), który przewiduje na podstawie częstotliwości klas.

Statystyka może być ujemna, co oznacza, że nie ma skutecznej zgodności między dwoma wskaźnikami lub zgodność jest gorsza niż losowa.
In [53]:
Cohen_Kappa(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)
Models before calibration
time: 7.0 second
+------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
| Name | gs_SVM | gs_SVM2 | gs_XGB | gs_LGBM | gs_KNN | gs_NBC | gs_LRE | gs_RFC | gs_RFC2 |
+------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
| p0:  | 0.688  |  0.783  | 0.975  |  0.935  | 0.964  | 0.688  | 0.715  | 0.805  |   0.66  |
| pe:  | 0.669  |  0.764  | 0.975  |  0.927  | 0.962  | 0.669  | 0.696  | 0.789  |  0.643  |
|  κ:  | 0.057  |  0.082  | 0.026  |  0.117  | 0.053  | 0.057  | 0.061  | 0.078  |  0.049  |
+------+--------+---------+--------+---------+--------+--------+--------+--------+---------+


Obserwowana zgodność p0
Jest to prawdopodobieństwo dobrego wyboru, to procent przypadków, które zostały sklasyfikowane prawidłowo w całej matrycy zamieszania, czyli prawdziwi chorzy zostali sklasyfikowani jako chorzy a prawdziwie zdrowi sklasyfikowani jako prawdziwie zdrowi
Oczekiwana zgodność pe
Jest to prawdopodobieństwo wyboru bezpośrednio związana z liczbą wystąpień każdej klasy. Jeżeli wystąpień klas było po równo (np. 1: 20 wystąpień i 0: 20 wystapień), czyli zbiór był zbilansowany, to prawdopodobieństwo wynosi 50%. 
Cohen Kappa mówi, o ile lepszy jest model klasyfikacji (p0) od losowego klasyfikatora(pe), który przewiduje na podstawie częstotliwości klas.

Statystyka może być ujemna, co oznacza, że nie ma skutecznej zgodności między dwoma wskaźnikami lub zgodność jest gorsza niż losowa.

Matthews Correlation Coefficient MCC obraz.png

The Matthews Correlation Coefficient (MCC) has a range of -1 to 1, where -1 is a completely bad binary classifier and 1 is a completely valid binary classifier.


$ bbox[23px,border:1px solid red]
{
MCC = displaystylefrac{{(tp times tn)}-{(fp times fn)}}{(tp+fp)(tp+fn)(tn+fp)(tn+fn)}
qquad
} $

obraz.png

In [54]:
def MCC(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):
    
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve
    import time   
    
    start_time = time.time()
    
    from sklearn import metrics
    import simple_colors
    
    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train)    
    
    
    MCC = ['MCC:']
    
    def compute_metric(model):
        
        from sklearn.metrics import confusion_matrix

        #model.fit(X_train,y_train)
        cm = confusion_matrix(y_test, model.predict(X_test))
        tn, fp, fn, tp = cm.ravel()     
        
        MCC = ((tp*tn)-(fp*fn))/(((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))** .5)
        MCC = np.round(MCC,decimals=3)
        MCC
        
        return MCC

    for cls in six_classifiers:
        
        results = compute_metric(cls)
        MCC.append(results)
             

    t = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
    t.add_row(MCC)
    
    print('Matthews Correlation Coefficient MCC')
        
    ### ---------------------------------------------------
    
    print(blue('Models before calibration','bold'))
    g = (time.time() - start_time)
    g = np.round(g)
    print('time: %s seconds' % g)
    print(t)
    
    ### ---------------------------------------------------
        
    if calibration != True:
        print()
    else:   
        print(blue('Models after calibration','bold'))

   

        start_time = time.time()
    
        from sklearn import metrics
        import simple_colors
    
        #for cls in six_classifiers:
        #    cls.fit(X_train, y_train)    
    
    
        MCC = ['MCC:']
    
        def compute_metric(model):
        
            from sklearn.metrics import confusion_matrix

            calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
            calibrated.fit(X_train, y_train)    
            cm = confusion_matrix(y_test, calibrated.predict(X_test))
            tn, fp, fn, tp = cm.ravel()     
        
            MCC = ((tp*tn)-(fp*fn))/(((tp+fp)*(tp+fn)*(tn+fp)*(tn+fn))** .5)
            MCC = np.round(MCC,decimals=3)
            MCC
        
            return MCC

        for cls in six_classifiers:
        
            results = compute_metric(cls)
            MCC.append(results)
             

        k = PrettyTable(['Name', name[0],name[1],name[2],name[3],name[4],name[5],name[6],name[7],name[8]])
        k.add_row(MCC)
    
        n = (time.time() - start_time)
        n = np.round(n)
        print('time: %s seconds' % n)         
        print(k)
    
       
    print(black('Współczynnik korelacji Matthewsa (MCC) ma zakres od -1 do 1, gdzie -1 oznacza całkowicie zły klasyfikator binarny, a 1 oznacza całkowicie poprawny klasyfikator binarny','italic'))
    
In [55]:
MCC(classifiers_A,nameA, X_train, y_train,X_test,y_test,calibration=False)
Matthews Correlation Coefficient MCC
Models before calibration
time: 4.0 seconds
+------+------+-------+------+------+-------+------+-------+-------+-------+
| Name | SVM  |  CBC  | XGB  | LGBM |  KNN  | NBC  |  LRE  |  RFC  |  GBC  |
+------+------+-------+------+------+-------+------+-------+-------+-------+
| MCC: | 0.15 | 0.114 | 0.02 | 0.13 | 0.052 | 0.18 | 0.152 | 0.077 | 0.165 |
+------+------+-------+------+------+-------+------+-------+-------+-------+

Współczynnik korelacji Matthewsa (MCC) ma zakres od -1 do 1, gdzie -1 oznacza całkowicie zły klasyfikator binarny, a 1 oznacza całkowicie poprawny klasyfikator binarny
In [56]:
MCC(classifiers_B,nameB, X_train, y_train,X_test,y_test,calibration=False)
Matthews Correlation Coefficient MCC
Models before calibration
time: 27.0 seconds
+------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
| Name | SVM_b | CBC_b | XGB_b | LGBM_b | KNN_b | NBC_b | LRE_b | RFC_b | GBC_b |
+------+-------+-------+-------+--------+-------+-------+-------+-------+-------+
| MCC: | 0.158 | 0.145 | 0.034 | 0.151  | 0.025 |  0.19 | 0.124 | 0.111 |  0.15 |
+------+-------+-------+-------+--------+-------+-------+-------+-------+-------+

Współczynnik korelacji Matthewsa (MCC) ma zakres od -1 do 1, gdzie -1 oznacza całkowicie zły klasyfikator binarny, a 1 oznacza całkowicie poprawny klasyfikator binarny
In [57]:
MCC(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)
Matthews Correlation Coefficient MCC
Models before calibration
time: 7.0 seconds
+------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
| Name | gs_SVM | gs_SVM2 | gs_XGB | gs_LGBM | gs_KNN | gs_NBC | gs_LRE | gs_RFC | gs_RFC2 |
+------+--------+---------+--------+---------+--------+--------+--------+--------+---------+
| MCC: |  0.15  |  0.172  | 0.029  |  0.137  | 0.053  |  0.15  | 0.151  | 0.154  |  0.136  |
+------+--------+---------+--------+---------+--------+--------+--------+--------+---------+

Współczynnik korelacji Matthewsa (MCC) ma zakres od -1 do 1, gdzie -1 oznacza całkowicie zły klasyfikator binarny, a 1 oznacza całkowicie poprawny klasyfikator binarny

Trainsize

In [58]:
def Trainsize(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):
    
    import time
    from mlxtend.plotting import plot_learning_curves
    from sklearn.calibration import CalibratedClassifierCV, calibration_curve
    
    start_time = time.time()
    
    #for cls in six_classifiers:
    #    cls.fit(X_train, y_train) 
        
    plt.figure(figsize=(15,7))

    grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.4)

    for i in range(9):
        col, row = i%3,i//3
        ax = plt.subplot(grid[row,col]) 
        ax.title.set_text(type(six_classifiers[i]).__name__)
        ax.title.set_color('blue')
    
        model = six_classifiers[i]
        plot_learning_curves(X_train, y_train, 
                             X_test, y_test, 
                             model, print_model=False, style='ggplot')
        
        ### ---------------------------------------------------
        
    if calibration != True:
        print()
    else:   
        print('IN PENDING')
        #for cls in six_classifiers:
        #    cls.fit(X_train, y_train) 
        
        #plt.figure(figsize=(15,7))

        #grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.4)

        #for i in range(9):
        #    col, row = i%3,i//3
        #    ax = plt.subplot(grid[row,col]) 
        #    ax.title.set_text(type(six_classifiers[i]).__name__)
        #    ax.title.set_color('blue')
    
            
        #    model = six_classifiers[i]
        #    calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5)
        #    calibrated.fit(X_train, y_train) 
            
        #    plot_learning_curves(X_train, y_train, 
        #                        X_test, y_test, 
        #                         calibrated, print_model=False, style='ggplot')
                
        
        
    n = (time.time() - start_time)
    n = np.round(n)
    print('time: %s seconds' % n)           
    print('Jeżeli zbiór testowy i treningowy bardzo odstają od siebie oznacza to przeuczenie modelu')
    print('Znajduje się tu miejsce gdzie oba wykresy testowy i treningowy są najbliżej siebie.')
    print('Dla takiej wielkości próby model działa najlepiej w kontekście przeuczenia na wykresie należy brać pod uwagę wielkość błędu klasyfikacji (oś y)')
In [59]:
Trainsize(classifiers_A,nameA, X_train, y_train,X_test,y_test,calibration=False)
time: 1031.0 seconds
Jeżeli zbiór testowy i treningowy bardzo odstają od siebie oznacza to przeuczenie modelu
Znajduje się tu miejsce gdzie oba wykresy testowy i treningowy są najbliżej siebie.
Dla takiej wielkości próby model działa najlepiej w kontekście przeuczenia na wykresie należy brać pod uwagę wielkość błędu klasyfikacji (oś y)

Trainsize(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)def ks_statistic(six_classifiers,name, X_train, y_train,X_test,y_test,calibration=True):

from matplotlib import rcParams ## Robie odstęp na podtytuł
rcParams[‘axes.titlepad’] = 20
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
import scikitplot as skplt
import time
start_time = time.time()

plt.figure(figsize=(15,10))
grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)

#for cls in six_classifiers:
# cls.fit(X_train, y_train)

for i in range(9):

col, row = i%3,i//3
ax = plt.subplot(grid[row,col])
ax.title.set_color(‘blue’)

model = six_classifiers[i]
# skplt.metrics.plot_roc(y_test, model.predict_proba(X_test), ax=ax, title=type(six_classifiers[i]).__name__)
skplt.metrics.plot_ks_statistic(y_test, model.predict_proba(X_test), ax=ax,title=type(six_classifiers[i]).__name__)
ax.text(0.5, 1.04, name[i],fontsize=10) ## podtytuł
ax.text(0.0, 1.04, ‘before calibration’,color=’black’, fontsize=10)
### —————————————————

if calibration != True:
print()
else:

plt.figure(figsize=(15,10))
grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)

#for cls in six_classifiers:
# cls.fit(X_train, y_train)

for i in range(9):

col, row = i%3,i//3
ax = plt.subplot(grid[row,col])
ax.title.set_color(‘blue’)

model = six_classifiers[i]
calibrated = CalibratedClassifierCV(model, method=’sigmoid’, cv=5)
calibrated.fit(X_train, y_train)

skplt.metrics.plot_ks_statistic(y_test, calibrated.predict_proba(X_test), ax=ax,title=type(six_classifiers[i]).__name__)
ax.text(0.5, 1.04, name[i],fontsize=10) ## podtytuł
ax.text(0.0, 1.04, ‘after calibration’,color=’red’, fontsize=10) ## podtytuł

n = (time.time() – start_time)
n = np.round(n)
print(‘time: %s seconds’ % n) ks_statistic(classifiers_A,nameA, X_train, y_train,X_test,y_test,calibration=False)ks_statistic(classifiers_B,nameB,X_train, y_train,X_test,y_test,calibration=False)ks_statistic(classifiers_F,nameF,X_train, y_train,X_test,y_test,calibration=False)def precision_recall2(six_classifiers,name, X_train, y_train,X_test,y_test):

from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from matplotlib import rcParams ## Robie odstęp na podtytuł
rcParams[‘axes.titlepad’] = 20
import time
start_time = time.time()

import scikitplot as skplt

plt.figure(figsize=(15,10))
grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)

#for cls in six_classifiers:
# cls.fit(X_train, y_train)

for i in range(9):

col, row = i%3,i//3
ax = plt.subplot(grid[row,col])
ax.title.set_color(‘blue’)

model = six_classifiers[i]

skplt.metrics.plot_precision_recall(y_test, model.predict_proba(X_test), ax=ax,title=type(six_classifiers[i]).__name__)
ax.text(0.5, 1.09, name[i],fontsize=10) ## podtytuł
ax.text(0.0, 1.04, ‘before calibration’,color=’black’, fontsize=10)
### ———————————————————————

plt.figure(figsize=(15,10))
grid = plt.GridSpec(3, 3, wspace=0.3, hspace=0.5)

#for cls in six_classifiers:
# cls.fit(X_train, y_train)

for i in range(9):

col, row = i%3,i//3
ax = plt.subplot(grid[row,col])
ax.title.set_color(‘blue’)

model = six_classifiers[i]
calibrated = CalibratedClassifierCV(model, method=’sigmoid’, cv=5)
calibrated.fit(X_train, y_train)

skplt.metrics.plot_precision_recall(y_test, calibrated.predict_proba(X_test), ax=ax,title=type(six_classifiers[i]).__name__)
ax.text(0.5, 1.09, name[i],fontsize=10) ## podtytuł
ax.text(0.0, 1.04, ‘after calibration’,color=’red’, fontsize=10) ## podtytuł

n = (time.time() – start_time)
n = np.round(n)
print(‘time: %s seconds’ % n)
print(blue(‘Jest to krzywa łącząca precyzję (PPV) i Recall (TPR) na jednym wykresie. Im wyższa krzywa na osi y, tym lepsza wydajność modelu. Informuje, przy którym recall, precision zaczyna spadać, może pomóc wybrać próg’))precision_recall2(classifiers_A,nameA, X_train, y_train,X_test,y_test)precision_recall2(classifiers_B,nameB, X_train, y_train, X_test, y_test)precision_recall2(classifiers_G,nameG,X_train, y_train,X_test,y_test)

Jak widac na wykresach problemem jest precyzjia klasy 1. Nie pomogła w tym zbilansowanie zbiorów przez oversampling.

def cumulative_gain(six_classifiers,name, X_train, y_train,X_test,y_test):

from matplotlib import rcParams ## Robie odstęp na podtytuł
rcParams[‘axes.titlepad’] = 20

import scikitplot as skplt

plt.figure(figsize=(15,7))
grid = plt.GridSpec(2, 3, wspace=0.3, hspace=0.5)

#for cls in six_classifiers:
# cls.fit(X_train, y_train)

for i in range(6):

col, row = i%3,i//3
ax = plt.subplot(grid[row,col])
ax.title.set_color(‘blue’)

model = six_classifiers[i]

skplt.metrics.plot_cumulative_gain(y_test, model.predict_proba(X_test), ax=ax,title=type(six_classifiers[i]).__name__)
ax.text(0.5, 1.04, name[i],fontsize=10) ## podtytuł

plt.show()cumulative_gain(classifiers_A,nameA, X_train, y_train,X_test,y_test)cumulative_gain(classifiers_B,nameB, X_train, y_train, X_test, y_test)cumulative_gain(classifiers_G,nameG,X_train, y_train,X_test,y_test)def lift_curve(six_classifiers,name, X_train, y_train,X_test,y_test):

import scikitplot as skplt

plt.figure(figsize=(15,7))
grid = plt.GridSpec(2, 3, wspace=0.3, hspace=0.5)

#for cls in six_classifiers:
# cls.fit(X_train, y_train)

for i in range(6):

col, row = i%3,i//3
ax = plt.subplot(grid[row,col])
ax.title.set_color(‘blue’)

model = six_classifiers[i]

skplt.metrics.plot_lift_curve(y_test, model.predict_proba(X_test), ax=ax,title=type(six_classifiers[i]).__name__)
ax.text(0.5, 8.04, name[i],fontsize=12) ## podtytuł

plt.show()lift_curve(classifiers_A,nameA, X_train, y_train,X_test,y_test)lift_curve(classifiers_B,nameB, X_train, y_train, X_test, y_test)

Koniec pomiaru czasu

In [60]:
print('Time to complete the task')
print('minutes: ',
(time.time() - start_time)/60) ## koniec pomiaru czasu
t = (time.time() - start_time)/60
a,b = df.shape

print('Czas ile minut na jedne rekord: ',t/a)
Time to complete the task
minutes:  54.42599579890569
Czas ile minut na jedne rekord:  0.002508110512419963

obraz.png