Someone recently told me that I do not write enough so I will write:
It is very nice when we have AUC of 85 Unfortunately, this model is suitable for the trash and needs to be improved a bit. Now this is the model who was supposed to find who had a stroke and said no one had a stroke. Because there were 1-2

I got so hooked that I started to contribute to stackoverflow!

https://stackoverflow.com/questions/31417487/sklearn-logisticregression-and-changing-the-default-threshold-for-classification/61644649#61644649

# https://github.com/dawidkopczyk/blog/blob/master/stacking.py

# Classification Assessment
def Classification_Assessment(model ,Xtrain, ytrain, Xtest, ytest):
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    
    import scikitplot as skplt
    from plot_metric.functions import BinaryClassification
    from sklearn.metrics import precision_recall_curve

       
    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Confusion Matrix Test data")
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    
    ## ----------AUC-----------------------------------------
     
    print('---------------------') 
    AUC_train_1 = metrics.roc_auc_score(ytrain,model.predict_proba(Xtrain)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(ytest,model.predict_proba(Xtest)[:,1])
    print('AUC_test:  
    print('---------------------')    
    
    print("Accuracy Training data:     ", np.round(accuracy_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Accuracy Test data:         ", np.round(accuracy_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("----------------------------------------------------------------------")
    print('Valuation for test data only:')

    y_probas1 = model.predict_proba(Xtest)[:,1]
    y_probas2 = model.predict_proba(Xtest)

### ---plot_roc_curve--------------------------------------------------------
    plt.figure(figsize=(13,4))

    plt.subplot(1, 2, 1)
    bc = BinaryClassification(ytest, y_probas1, labels=["Class 1", "Class 2"])
    bc.plot_roc_curve() 


### --------precision_recall_curve------------------------------------------

    plt.subplot(1, 2, 2)
    precision, recall, thresholds = precision_recall_curve(ytest, y_probas1)

    plt.plot(recall, precision, marker='.', label=model)
    plt.title('Precision recall curve')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.legend(loc=(-0.30, -0.6))
    plt.show()

## ----------plot_roc-----------------------------------------

    skplt.metrics.plot_roc(ytest, y_probas2)

# General
import numpy as np

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB 

# Utilities
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score
from copy import copy as make_copy

from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from plot_metric.functions import BinaryClassification
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix


import warnings   
SEED = 2018
warnings.filterwarnings("ignore")

import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/Stroke_Prediction.csv')
print(df.shape)
df.head(2)

(43400, 12)

DISCRETE FUNCTIONS CODED
------------------------
Gender --- object
Ever_Married --- object
Type_Of_Work --- object
Residence --- object
Smoking_Status --- object

Gender            0
Age_In_Days       0
Hypertension      0
Heart_Disease     0
Ever_Married      0
Type_Of_Work      0
Residence         0
Avg_Glucose       0
BMI               0
Smoking_Status    0
Stroke            0
dtype: int64

(41938, 11)

LogisticRegression accuracy: 98.43

AUC_train: 0.690
AUC_test:  0.688
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
RandomForestClassifier accuracy: 98.46

AUC_train: 1.000
AUC_test:  0.803
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
AdaBoostClassifier accuracy: 98.46

AUC_train: 0.871
AUC_test:  0.856
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
GaussianNB accuracy: 94.93

AUC_train: 0.840
AUC_test:  0.847
              precision    recall  f1-score   support

           0       0.99      0.96      0.97      8259
           1       0.07      0.19      0.10       129

    accuracy                           0.95      8388
   macro avg       0.53      0.57      0.54      8388
weighted avg       0.97      0.95      0.96      8388

===============================================================

import numpy as np

a,b = df.shape     #<- ile mamy kolumn
b


print('DISCRETE FUNCTIONS CODED')
print('------------------------')
for i in range(1,b):
    i = df.columns[i]
    f = df[i].dtypes
    if f == np.object:
        print(i,"---",f)   
    
        if f == np.object:
        
            df[i] = pd.Categorical(df[i]).codes
        
            continue

DISCRETE FUNCTIONS CODED
------------------------
Gender --- object
Ever_Married --- object
Type_Of_Work --- object
Residence --- object
Smoking_Status --- object

Gender            0
Age_In_Days       0
Hypertension      0
Heart_Disease     0
Ever_Married      0
Type_Of_Work      0
Residence         0
Avg_Glucose       0
BMI               0
Smoking_Status    0
Stroke            0
dtype: int64

(41938, 11)

LogisticRegression accuracy: 98.43

AUC_train: 0.690
AUC_test:  0.688
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
RandomForestClassifier accuracy: 98.46

AUC_train: 1.000
AUC_test:  0.803
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
AdaBoostClassifier accuracy: 98.46

AUC_train: 0.871
AUC_test:  0.856
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
GaussianNB accuracy: 94.93

AUC_train: 0.840
AUC_test:  0.847
              precision    recall  f1-score   support

           0       0.99      0.96      0.97      8259
           1       0.07      0.19      0.10       129

    accuracy                           0.95      8388
   macro avg       0.53      0.57      0.54      8388
weighted avg       0.97      0.95      0.96      8388

===============================================================

def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

del df['ID']
df = df.dropna(how='any')
df.isnull().sum()

Gender            0
Age_In_Days       0
Hypertension      0
Heart_Disease     0
Ever_Married      0
Type_Of_Work      0
Residence         0
Avg_Glucose       0
BMI               0
Smoking_Status    0
Stroke            0
dtype: int64

df.shape

(41938, 11)

y = df['Stroke']
X = df.drop('Stroke', axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

y_train.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')

LogisticRegression accuracy: 98.43

AUC_train: 0.690
AUC_test:  0.688
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
RandomForestClassifier accuracy: 98.46

AUC_train: 1.000
AUC_test:  0.803
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
AdaBoostClassifier accuracy: 98.46

AUC_train: 0.871
AUC_test:  0.856
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
GaussianNB accuracy: 94.93

AUC_train: 0.840
AUC_test:  0.847
              precision    recall  f1-score   support

           0       0.99      0.96      0.97      8259
           1       0.07      0.19      0.10       129

    accuracy                           0.95      8388
   macro avg       0.53      0.57      0.54      8388
weighted avg       0.97      0.95      0.96      8388

===============================================================

def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

# Define 4-fold CV     ## można dać dowolną liczbę faud
cv = KFold(n_splits=6, random_state=SEED)    ## wpisuje ilości podziałow w cross-validation

# Loop over classifier to produce meta features
meta_train = []
for clf in base_clf:
    
    # Create hold out predictions for a classifier
    meta_train_clf = hold_out_predict(clf, X_train, y_train, cv)
    
    # Remove redundant column
    meta_train_clf = np.delete(meta_train_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_train.append(meta_train_clf)
    
meta_train = np.array(meta_train).T

Starting hold out prediction with 6 splits for LogisticRegression.
Starting hold out prediction with 6 splits for RandomForestClassifier.
Starting hold out prediction with 6 splits for AdaBoostClassifier.
Starting hold out prediction with 6 splits for GaussianNB.

meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T

X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

Define Base (level 0) and Stacking (level 1) estimators

base_clf = [LogisticRegression(), RandomForestClassifier(), ### jakie model chce trenować
            AdaBoostClassifier(), GaussianNB()]

stck_clf = LogisticRegression()  ### układanie w stos odbyw się za pomocą LogisticRegression
#stck_clf = RandomForestClassifier()

Evaluate Base estimators separately

## Wstępna ocena bazowych estymatorów (modeli)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

for t in base_clf:
    
    # Set seed
    if 'kot' in t.get_params().keys():  # pobierz z modeli, które chce trenować kluczowe hiperparametry 
        t.set_params(random_state=SEED)  ## Podaje parametry PODSTAWOWE modelu, doyślne, fabryczne!!
                                           ## to znaczy, że jak podam specjalny hiperparament w modelu to będzie on uwzględniony             
    
    # Fit model
    t.fit(X_train, y_train) # Podstawiam do kolejnego modelu z pętli
    
    # Predict
    y_pred = t.predict(X_test)   #predykcja kolejnego modelu z pętli
    
    # Valuation
    acc = accuracy_score(y_test, y_pred)
    #pre = precision_score(y_test, y_pred,average = 'macro')
    #auc = roc_auc_score(y_test, y_pred)
    
    print('{} accuracy: {:.2f}
    
    plt.figure(figsize=(7,3))
    y_probas1 = t.predict_proba(X_test)[:,1]
    bc= BinaryClassification(y_test, y_probas1, labels=[t.__class__.__name__]).plot_roc_curve()
    plt.show()
    AUC_train_1 = metrics.roc_auc_score(y_train,t.predict_proba(X_train)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(y_test,t.predict_proba(X_test)[:,1])
    print('AUC_test:  
    print(classification_report(y_test, t.predict(X_test)))
    print('===============================================================')

LogisticRegression accuracy: 98.43

AUC_train: 0.690
AUC_test:  0.688
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
RandomForestClassifier accuracy: 98.46

AUC_train: 1.000
AUC_test:  0.803
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
AdaBoostClassifier accuracy: 98.46

AUC_train: 0.871
AUC_test:  0.856
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
GaussianNB accuracy: 94.93

AUC_train: 0.840
AUC_test:  0.847
              precision    recall  f1-score   support

           0       0.99      0.96      0.97      8259
           1       0.07      0.19      0.10       129

    accuracy                           0.95      8388
   macro avg       0.53      0.57      0.54      8388
weighted avg       0.97      0.95      0.96      8388

===============================================================

def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

# Define 4-fold CV     ## można dać dowolną liczbę faud
cv = KFold(n_splits=6, random_state=SEED)    ## wpisuje ilości podziałow w cross-validation

# Loop over classifier to produce meta features
meta_train = []
for clf in base_clf:
    
    # Create hold out predictions for a classifier
    meta_train_clf = hold_out_predict(clf, X_train, y_train, cv)
    
    # Remove redundant column
    meta_train_clf = np.delete(meta_train_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_train.append(meta_train_clf)
    
meta_train = np.array(meta_train).T

Starting hold out prediction with 6 splits for LogisticRegression.
Starting hold out prediction with 6 splits for RandomForestClassifier.
Starting hold out prediction with 6 splits for AdaBoostClassifier.
Starting hold out prediction with 6 splits for GaussianNB.

meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T

# Set seed
if 'random_state' in stck_clf.get_params().keys():
    stck_clf.set_params(random_state=SEED)

# Optional (Add original features to meta)
original_flag = False
if original_flag:
    meta_train = np.concatenate((meta_train, X_train), axis=1)
    meta_test = np.concatenate((meta_test, X_test), axis=1)

# Fit model
stck_clf.fit(meta_train, y_train)

# Predict
y_pred = stck_clf.predict(meta_test)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred,average = 'macro')
auc = roc_auc_score(y_test, y_pred)

print('Stacking {} AUC: {:.4f}

Create Hold Out predictions (meta-features)

def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

Create meta-features for training data

# Define 4-fold CV     ## można dać dowolną liczbę faud
cv = KFold(n_splits=6, random_state=SEED)    ## wpisuje ilości podziałow w cross-validation

# Loop over classifier to produce meta features
meta_train = []
for clf in base_clf:
    
    # Create hold out predictions for a classifier
    meta_train_clf = hold_out_predict(clf, X_train, y_train, cv)
    
    # Remove redundant column
    meta_train_clf = np.delete(meta_train_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_train.append(meta_train_clf)
    
meta_train = np.array(meta_train).T

# Define 4-fold CV     ## można dać dowolną liczbę faud
cv = KFold(n_splits=6, random_state=SEED)    ## wpisuje ilości podziałow w cross-validation

# Loop over classifier to produce meta features
meta_train = []
for clf in base_clf:
    
    # Create hold out predictions for a classifier
    meta_train_clf = hold_out_predict(clf, X_train, y_train, cv)
    
    # Remove redundant column
    meta_train_clf = np.delete(meta_train_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_train.append(meta_train_clf)
    
meta_train = np.array(meta_train).T

Starting hold out prediction with 6 splits for LogisticRegression.
Starting hold out prediction with 6 splits for RandomForestClassifier.
Starting hold out prediction with 6 splits for AdaBoostClassifier.
Starting hold out prediction with 6 splits for GaussianNB.

meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T

# Set seed
if 'random_state' in stck_clf.get_params().keys():
    stck_clf.set_params(random_state=SEED)

# Optional (Add original features to meta)
original_flag = False
if original_flag:
    meta_train = np.concatenate((meta_train, X_train), axis=1)
    meta_test = np.concatenate((meta_test, X_test), axis=1)

# Fit model
stck_clf.fit(meta_train, y_train)

# Predict
y_pred = stck_clf.predict(meta_test)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred,average = 'macro')
auc = roc_auc_score(y_test, y_pred)

print('Stacking {} AUC: {:.4f}

Stacking LogisticRegression AUC: 98.4621

Classification_Assessment(stck_clf ,meta_train, y_train, meta_test, y_test)

Recall Training data:      0.0
Precision Training data:   0.0
----------------------------------------------------------------------
Recall Test data:          0.0
Precision Test data:       0.0
----------------------------------------------------------------------
Confusion Matrix Test data
[[8259    0]
 [ 129    0]]
----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

---------------------
AUC_train: 0.840
AUC_test:  0.846
---------------------
Accuracy Training data:      0.9847
Accuracy Test data:          0.9846
----------------------------------------------------------------------
Valuation for test data only:

First, a thick definition of homemade

def oversampling(ytrain, Xtrain):
    import matplotlib.pyplot as plt
    
    global Xtrain_OV
    global ytrain_OV

    calss1 = np.round((sum(ytrain == 1)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    calss0 = np.round((sum(ytrain == 0)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    
    print("y = 0: ", sum(ytrain == 0),'-------',calss0,'
    print("y = 1: ", sum(ytrain == 1),'-------',calss1,'
    print('--------------------------------------------------------')
    
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    print()
    
    Proporcja = sum(ytrain == 0) / sum(ytrain == 1)
    Proporcja = np.round(Proporcja, decimals=0)
    Proporcja = Proporcja.astype(int)
       
    ytrain_OV = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
    Xtrain_OV = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
    
    ytrain_OV = pd.concat([ytrain, ytrain_OV], axis = 0).reset_index(drop = True)
    Xtrain_OV = pd.concat([Xtrain, Xtrain_OV], axis = 0).reset_index(drop = True)
    
    Xtrain_OV = pd.DataFrame(Xtrain_OV)
    ytrain_OV = pd.DataFrame(ytrain_OV)
    

    
    print("Before oversampling Xtrain:     ", Xtrain.shape)
    print("Before oversampling ytrain:     ", ytrain.shape)
    print('--------------------------------------------------------')
    print("After oversampling Xtrain_OV:  ", Xtrain_OV.shape)
    print("After oversampling ytrain_OV:  ", ytrain_OV.shape)
    print('--------------------------------------------------------')
    
    
    ax = plt.subplot(1, 2, 1)
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    
       
    kot = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0)
    kot = pd.concat([ytrain, kot], axis = 0).reset_index(drop = True)
    ax = plt.subplot(1, 2, 2)
    kot.value_counts(dropna = False, normalize=True).plot(kind='pie',title='After oversampling')
    plt.show

y = df['Stroke']
X = df.drop('Stroke', axis=1)

Create meta-features for testing data

meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T

meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T

Predict on Stacking Classifier

# Set seed
if 'random_state' in stck_clf.get_params().keys():
    stck_clf.set_params(random_state=SEED)

# Optional (Add original features to meta)
original_flag = False
if original_flag:
    meta_train = np.concatenate((meta_train, X_train), axis=1)
    meta_test = np.concatenate((meta_test, X_test), axis=1)

# Fit model
stck_clf.fit(meta_train, y_train)

# Predict
y_pred = stck_clf.predict(meta_test)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred,average = 'macro')
auc = roc_auc_score(y_test, y_pred)

print('Stacking {} AUC: {:.4f}

# Set seed
if 'random_state' in stck_clf.get_params().keys():
    stck_clf.set_params(random_state=SEED)

# Optional (Add original features to meta)
original_flag = False
if original_flag:
    meta_train = np.concatenate((meta_train, X_train), axis=1)
    meta_test = np.concatenate((meta_test, X_test), axis=1)

# Fit model
stck_clf.fit(meta_train, y_train)

# Predict
y_pred = stck_clf.predict(meta_test)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred,average = 'macro')
auc = roc_auc_score(y_test, y_pred)

print('Stacking {} AUC: {:.4f}

Stacking LogisticRegression AUC: 98.4621

Classification_Assessment(stck_clf ,meta_train, y_train, meta_test, y_test)

Recall Training data:      0.0
Precision Training data:   0.0
----------------------------------------------------------------------
Recall Test data:          0.0
Precision Test data:       0.0
----------------------------------------------------------------------
Confusion Matrix Test data
[[8259    0]
 [ 129    0]]
----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

---------------------
AUC_train: 0.840
AUC_test:  0.846
---------------------
Accuracy Training data:      0.9847
Accuracy Test data:          0.9846
----------------------------------------------------------------------
Valuation for test data only:

First, a thick definition of homemade

def oversampling(ytrain, Xtrain):
    import matplotlib.pyplot as plt
    
    global Xtrain_OV
    global ytrain_OV

    calss1 = np.round((sum(ytrain == 1)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    calss0 = np.round((sum(ytrain == 0)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    
    print("y = 0: ", sum(ytrain == 0),'-------',calss0,'
    print("y = 1: ", sum(ytrain == 1),'-------',calss1,'
    print('--------------------------------------------------------')
    
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    print()
    
    Proporcja = sum(ytrain == 0) / sum(ytrain == 1)
    Proporcja = np.round(Proporcja, decimals=0)
    Proporcja = Proporcja.astype(int)
       
    ytrain_OV = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
    Xtrain_OV = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
    
    ytrain_OV = pd.concat([ytrain, ytrain_OV], axis = 0).reset_index(drop = True)
    Xtrain_OV = pd.concat([Xtrain, Xtrain_OV], axis = 0).reset_index(drop = True)
    
    Xtrain_OV = pd.DataFrame(Xtrain_OV)
    ytrain_OV = pd.DataFrame(ytrain_OV)
    

    
    print("Before oversampling Xtrain:     ", Xtrain.shape)
    print("Before oversampling ytrain:     ", ytrain.shape)
    print('--------------------------------------------------------')
    print("After oversampling Xtrain_OV:  ", Xtrain_OV.shape)
    print("After oversampling ytrain_OV:  ", ytrain_OV.shape)
    print('--------------------------------------------------------')
    
    
    ax = plt.subplot(1, 2, 1)
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    
       
    kot = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0)
    kot = pd.concat([ytrain, kot], axis = 0).reset_index(drop = True)
    ax = plt.subplot(1, 2, 2)
    kot.value_counts(dropna = False, normalize=True).plot(kind='pie',title='After oversampling')
    plt.show

y = df['Stroke']
X = df.drop('Stroke', axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

oversampling(y_train, X_train)

y = 0:  33036 ------- 98.0 
y = 1:  514 ------- 2.0 
--------------------------------------------------------

Before oversampling Xtrain:      (33550, 10)
Before oversampling ytrain:      (33550,)
--------------------------------------------------------
After oversampling Xtrain_OV:   (66446, 10)
After oversampling ytrain_OV:   (66446, 1)
--------------------------------------------------------

Classification_Assessment(stck_clf ,meta_train, y_train, meta_test, y_test)

Classification_Assessment(stck_clf ,meta_train, y_train, meta_test, y_test)

Recall Training data:      0.0
Precision Training data:   0.0
----------------------------------------------------------------------
Recall Test data:          0.0
Precision Test data:       0.0
----------------------------------------------------------------------
Confusion Matrix Test data
[[8259    0]
 [ 129    0]]
----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

---------------------
AUC_train: 0.840
AUC_test:  0.846
---------------------
Accuracy Training data:      0.9847
Accuracy Test data:          0.9846
----------------------------------------------------------------------
Valuation for test data only:

First, a thick definition of homemade

def oversampling(ytrain, Xtrain):
    import matplotlib.pyplot as plt
    
    global Xtrain_OV
    global ytrain_OV

    calss1 = np.round((sum(ytrain == 1)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    calss0 = np.round((sum(ytrain == 0)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    
    print("y = 0: ", sum(ytrain == 0),'-------',calss0,'
    print("y = 1: ", sum(ytrain == 1),'-------',calss1,'
    print('--------------------------------------------------------')
    
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    print()
    
    Proporcja = sum(ytrain == 0) / sum(ytrain == 1)
    Proporcja = np.round(Proporcja, decimals=0)
    Proporcja = Proporcja.astype(int)
       
    ytrain_OV = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
    Xtrain_OV = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
    
    ytrain_OV = pd.concat([ytrain, ytrain_OV], axis = 0).reset_index(drop = True)
    Xtrain_OV = pd.concat([Xtrain, Xtrain_OV], axis = 0).reset_index(drop = True)
    
    Xtrain_OV = pd.DataFrame(Xtrain_OV)
    ytrain_OV = pd.DataFrame(ytrain_OV)
    

    
    print("Before oversampling Xtrain:     ", Xtrain.shape)
    print("Before oversampling ytrain:     ", ytrain.shape)
    print('--------------------------------------------------------')
    print("After oversampling Xtrain_OV:  ", Xtrain_OV.shape)
    print("After oversampling ytrain_OV:  ", ytrain_OV.shape)
    print('--------------------------------------------------------')
    
    
    ax = plt.subplot(1, 2, 1)
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    
       
    kot = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0)
    kot = pd.concat([ytrain, kot], axis = 0).reset_index(drop = True)
    ax = plt.subplot(1, 2, 2)
    kot.value_counts(dropna = False, normalize=True).plot(kind='pie',title='After oversampling')
    plt.show

y = df['Stroke']
X = df.drop('Stroke', axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

oversampling(y_train, X_train)

y = 0:  33036 ------- 98.0 
y = 1:  514 ------- 2.0 
--------------------------------------------------------

Before oversampling Xtrain:      (33550, 10)
Before oversampling ytrain:      (33550,)
--------------------------------------------------------
After oversampling Xtrain_OV:   (66446, 10)
After oversampling ytrain_OV:   (66446, 1)
--------------------------------------------------------

X_train = Xtrain_OV.values
X_test = X_test.values
y_train = ytrain_OV.values
y_test = y_test.values

base_clf = [LogisticRegression(), RandomForestClassifier(), ### jakie model chce trenować
            AdaBoostClassifier(), GaussianNB()]

stck_OV = LogisticRegression()  ### układanie w stos odbyw się za pomocą LogisticRegression
#stck_clf = RandomForestClassifier()

OVERSAMPLING

First, a thick definition of homemade

def oversampling(ytrain, Xtrain):
    import matplotlib.pyplot as plt
    
    global Xtrain_OV
    global ytrain_OV

    calss1 = np.round((sum(ytrain == 1)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    calss0 = np.round((sum(ytrain == 0)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    
    print("y = 0: ", sum(ytrain == 0),'-------',calss0,'
    print("y = 1: ", sum(ytrain == 1),'-------',calss1,'
    print('--------------------------------------------------------')
    
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    print()
    
    Proporcja = sum(ytrain == 0) / sum(ytrain == 1)
    Proporcja = np.round(Proporcja, decimals=0)
    Proporcja = Proporcja.astype(int)
       
    ytrain_OV = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
    Xtrain_OV = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
    
    ytrain_OV = pd.concat([ytrain, ytrain_OV], axis = 0).reset_index(drop = True)
    Xtrain_OV = pd.concat([Xtrain, Xtrain_OV], axis = 0).reset_index(drop = True)
    
    Xtrain_OV = pd.DataFrame(Xtrain_OV)
    ytrain_OV = pd.DataFrame(ytrain_OV)
    

    
    print("Before oversampling Xtrain:     ", Xtrain.shape)
    print("Before oversampling ytrain:     ", ytrain.shape)
    print('--------------------------------------------------------')
    print("After oversampling Xtrain_OV:  ", Xtrain_OV.shape)
    print("After oversampling ytrain_OV:  ", ytrain_OV.shape)
    print('--------------------------------------------------------')
    
    
    ax = plt.subplot(1, 2, 1)
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    
       
    kot = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0)
    kot = pd.concat([ytrain, kot], axis = 0).reset_index(drop = True)
    ax = plt.subplot(1, 2, 2)
    kot.value_counts(dropna = False, normalize=True).plot(kind='pie',title='After oversampling')
    plt.show

def oversampling(ytrain, Xtrain):
    import matplotlib.pyplot as plt
    
    global Xtrain_OV
    global ytrain_OV

    calss1 = np.round((sum(ytrain == 1)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    calss0 = np.round((sum(ytrain == 0)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    
    print("y = 0: ", sum(ytrain == 0),'-------',calss0,'
    print("y = 1: ", sum(ytrain == 1),'-------',calss1,'
    print('--------------------------------------------------------')
    
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    print()
    
    Proporcja = sum(ytrain == 0) / sum(ytrain == 1)
    Proporcja = np.round(Proporcja, decimals=0)
    Proporcja = Proporcja.astype(int)
       
    ytrain_OV = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
    Xtrain_OV = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
    
    ytrain_OV = pd.concat([ytrain, ytrain_OV], axis = 0).reset_index(drop = True)
    Xtrain_OV = pd.concat([Xtrain, Xtrain_OV], axis = 0).reset_index(drop = True)
    
    Xtrain_OV = pd.DataFrame(Xtrain_OV)
    ytrain_OV = pd.DataFrame(ytrain_OV)
    

    
    print("Before oversampling Xtrain:     ", Xtrain.shape)
    print("Before oversampling ytrain:     ", ytrain.shape)
    print('--------------------------------------------------------')
    print("After oversampling Xtrain_OV:  ", Xtrain_OV.shape)
    print("After oversampling ytrain_OV:  ", ytrain_OV.shape)
    print('--------------------------------------------------------')
    
    
    ax = plt.subplot(1, 2, 1)
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    
       
    kot = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0)
    kot = pd.concat([ytrain, kot], axis = 0).reset_index(drop = True)
    ax = plt.subplot(1, 2, 2)
    kot.value_counts(dropna = False, normalize=True).plot(kind='pie',title='After oversampling')
    plt.show

Reads the data again.

y = df['Stroke']
X = df.drop('Stroke', axis=1)

y = df['Stroke']
X = df.drop('Stroke', axis=1)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

I’m working on a matrix.

oversampling(y_train, X_train)

oversampling(y_train, X_train)

y = 0:  33036 ------- 98.0 
y = 1:  514 ------- 2.0 
--------------------------------------------------------

Before oversampling Xtrain:      (33550, 10)
Before oversampling ytrain:      (33550,)
--------------------------------------------------------
After oversampling Xtrain_OV:   (66446, 10)
After oversampling ytrain_OV:   (66446, 1)
--------------------------------------------------------

X_train = Xtrain_OV.values
X_test = X_test.values
y_train = ytrain_OV.values
y_test = y_test.values

base_clf = [LogisticRegression(), RandomForestClassifier(), ### jakie model chce trenować
            AdaBoostClassifier(), GaussianNB()]

stck_OV = LogisticRegression()  ### układanie w stos odbyw się za pomocą LogisticRegression
#stck_clf = RandomForestClassifier()

## Wstępna ocena bazowych estymatorów (modeli)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

for t in base_clf:
    
    # Set seed
    if 'kot' in t.get_params().keys():  # pobierz z modeli, które chce trenować kluczowe hiperparametry 
        t.set_params(random_state=SEED)  ## Podaje parametry PODSTAWOWE modelu, doyślne, fabryczne!!
                                           ## to znaczy, że jak podam specjalny hiperparament w modelu to będzie on uwzględniony             
    
    # Fit model
    t.fit(X_train, y_train) # Podstawiam do kolejnego modelu z pętli
    
    # Predict
    y_pred = t.predict(X_test)   #predykcja kolejnego modelu z pętli
    
    # Valuation
    acc = accuracy_score(y_test, y_pred)
    #pre = precision_score(y_test, y_pred,average = 'macro')
    #auc = roc_auc_score(y_test, y_pred)
    
    print('{} accuracy: {:.2f}
    
    plt.figure(figsize=(7,3))
    y_probas1 = t.predict_proba(X_test)[:,1]
    bc= BinaryClassification(y_test, y_probas1, labels=[t.__class__.__name__]).plot_roc_curve()
    plt.show()
    AUC_train_1 = metrics.roc_auc_score(y_train,t.predict_proba(X_train)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(y_test,t.predict_proba(X_test)[:,1])
    print('AUC_test:  
    print(classification_report(y_test, t.predict(X_test)))
    print('===============================================================')

LogisticRegression accuracy: 67.26

AUC_train: 0.817
AUC_test:  0.819
              precision    recall  f1-score   support

           0       1.00      0.67      0.80      8259
           1       0.04      0.83      0.07       129

    accuracy                           0.67      8388
   macro avg       0.52      0.75      0.44      8388
weighted avg       0.98      0.67      0.79      8388

===============================================================
RandomForestClassifier accuracy: 98.39

AUC_train: 1.000
AUC_test:  0.775
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
AdaBoostClassifier accuracy: 70.68

AUC_train: 0.871
AUC_test:  0.849
              precision    recall  f1-score   support

           0       1.00      0.70      0.83      8259
           1       0.04      0.84      0.08       129

    accuracy                           0.71      8388
   macro avg       0.52      0.77      0.45      8388
weighted avg       0.98      0.71      0.81      8388

===============================================================
GaussianNB accuracy: 71.15

AUC_train: 0.840
AUC_test:  0.847
              precision    recall  f1-score   support

           0       1.00      0.71      0.83      8259
           1       0.04      0.84      0.08       129

    accuracy                           0.71      8388
   macro avg       0.52      0.77      0.46      8388
weighted avg       0.98      0.71      0.82      8388

===============================================================

X_train = Xtrain_OV.values
X_test = X_test.values
y_train = ytrain_OV.values
y_test = y_test.values

X_train = Xtrain_OV.values
X_test = X_test.values
y_train = ytrain_OV.values
y_test = y_test.values

Define Base (level 0) and Stacking (level 1) estimators¶

base_clf = [LogisticRegression(), RandomForestClassifier(), ### jakie model chce trenować
            AdaBoostClassifier(), GaussianNB()]

stck_OV = LogisticRegression()  ### układanie w stos odbyw się za pomocą LogisticRegression
#stck_clf = RandomForestClassifier()

base_clf = [LogisticRegression(), RandomForestClassifier(), ### jakie model chce trenować
            AdaBoostClassifier(), GaussianNB()]

stck_OV = LogisticRegression()  ### układanie w stos odbyw się za pomocą LogisticRegression
#stck_clf = RandomForestClassifier()

Evaluate Base estimators separately¶

## Wstępna ocena bazowych estymatorów (modeli)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

for t in base_clf:
    
    # Set seed
    if 'kot' in t.get_params().keys():  # pobierz z modeli, które chce trenować kluczowe hiperparametry 
        t.set_params(random_state=SEED)  ## Podaje parametry PODSTAWOWE modelu, doyślne, fabryczne!!
                                           ## to znaczy, że jak podam specjalny hiperparament w modelu to będzie on uwzględniony             
    
    # Fit model
    t.fit(X_train, y_train) # Podstawiam do kolejnego modelu z pętli
    
    # Predict
    y_pred = t.predict(X_test)   #predykcja kolejnego modelu z pętli
    
    # Valuation
    acc = accuracy_score(y_test, y_pred)
    #pre = precision_score(y_test, y_pred,average = 'macro')
    #auc = roc_auc_score(y_test, y_pred)
    
    print('{} accuracy: {:.2f}
    
    plt.figure(figsize=(7,3))
    y_probas1 = t.predict_proba(X_test)[:,1]
    bc= BinaryClassification(y_test, y_probas1, labels=[t.__class__.__name__]).plot_roc_curve()
    plt.show()
    AUC_train_1 = metrics.roc_auc_score(y_train,t.predict_proba(X_train)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(y_test,t.predict_proba(X_test)[:,1])
    print('AUC_test:  
    print(classification_report(y_test, t.predict(X_test)))
    print('===============================================================')

## Wstępna ocena bazowych estymatorów (modeli)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

for t in base_clf:
    
    # Set seed
    if 'kot' in t.get_params().keys():  # pobierz z modeli, które chce trenować kluczowe hiperparametry 
        t.set_params(random_state=SEED)  ## Podaje parametry PODSTAWOWE modelu, doyślne, fabryczne!!
                                           ## to znaczy, że jak podam specjalny hiperparament w modelu to będzie on uwzględniony             
    
    # Fit model
    t.fit(X_train, y_train) # Podstawiam do kolejnego modelu z pętli
    
    # Predict
    y_pred = t.predict(X_test)   #predykcja kolejnego modelu z pętli
    
    # Valuation
    acc = accuracy_score(y_test, y_pred)
    #pre = precision_score(y_test, y_pred,average = 'macro')
    #auc = roc_auc_score(y_test, y_pred)
    
    print('{} accuracy: {:.2f}
    
    plt.figure(figsize=(7,3))
    y_probas1 = t.predict_proba(X_test)[:,1]
    bc= BinaryClassification(y_test, y_probas1, labels=[t.__class__.__name__]).plot_roc_curve()
    plt.show()
    AUC_train_1 = metrics.roc_auc_score(y_train,t.predict_proba(X_train)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(y_test,t.predict_proba(X_test)[:,1])
    print('AUC_test:  
    print(classification_report(y_test, t.predict(X_test)))
    print('===============================================================')

LogisticRegression accuracy: 67.26

AUC_train: 0.817
AUC_test:  0.819
              precision    recall  f1-score   support

           0       1.00      0.67      0.80      8259
           1       0.04      0.83      0.07       129

    accuracy                           0.67      8388
   macro avg       0.52      0.75      0.44      8388
weighted avg       0.98      0.67      0.79      8388

===============================================================
RandomForestClassifier accuracy: 98.39

AUC_train: 1.000
AUC_test:  0.775
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
AdaBoostClassifier accuracy: 70.68

AUC_train: 0.871
AUC_test:  0.849
              precision    recall  f1-score   support

           0       1.00      0.70      0.83      8259
           1       0.04      0.84      0.08       129

    accuracy                           0.71      8388
   macro avg       0.52      0.77      0.45      8388
weighted avg       0.98      0.71      0.81      8388

===============================================================
GaussianNB accuracy: 71.15

AUC_train: 0.840
AUC_test:  0.847
              precision    recall  f1-score   support

           0       1.00      0.71      0.83      8259
           1       0.04      0.84      0.08       129

    accuracy                           0.71      8388
   macro avg       0.52      0.77      0.46      8388
weighted avg       0.98      0.71      0.82      8388

===============================================================

def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

# Define 4-fold CV     ## można dać dowolną liczbę faud
cv = KFold(n_splits=6, random_state=SEED)    ## wpisuje ilości podziałow w cross-validation

# Loop over classifier to produce meta features
meta_train = []
for clf in base_clf:
    
    # Create hold out predictions for a classifier
    meta_train_clf = hold_out_predict(clf, X_train, y_train, cv)
    
    # Remove redundant column
    meta_train_clf = np.delete(meta_train_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_train.append(meta_train_clf)
    
meta_train = np.array(meta_train).T

Starting hold out prediction with 6 splits for LogisticRegression.
Starting hold out prediction with 6 splits for RandomForestClassifier.
Starting hold out prediction with 6 splits for AdaBoostClassifier.
Starting hold out prediction with 6 splits for GaussianNB.

meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T

Create Hold Out predictions (meta-features)¶

def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

Create meta-features for training data¶

# Define 4-fold CV     ## można dać dowolną liczbę faud
cv = KFold(n_splits=6, random_state=SEED)    ## wpisuje ilości podziałow w cross-validation

# Loop over classifier to produce meta features
meta_train = []
for clf in base_clf:
    
    # Create hold out predictions for a classifier
    meta_train_clf = hold_out_predict(clf, X_train, y_train, cv)
    
    # Remove redundant column
    meta_train_clf = np.delete(meta_train_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_train.append(meta_train_clf)
    
meta_train = np.array(meta_train).T

# Define 4-fold CV     ## można dać dowolną liczbę faud
cv = KFold(n_splits=6, random_state=SEED)    ## wpisuje ilości podziałow w cross-validation

# Loop over classifier to produce meta features
meta_train = []
for clf in base_clf:
    
    # Create hold out predictions for a classifier
    meta_train_clf = hold_out_predict(clf, X_train, y_train, cv)
    
    # Remove redundant column
    meta_train_clf = np.delete(meta_train_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_train.append(meta_train_clf)
    
meta_train = np.array(meta_train).T

Starting hold out prediction with 6 splits for LogisticRegression.
Starting hold out prediction with 6 splits for RandomForestClassifier.
Starting hold out prediction with 6 splits for AdaBoostClassifier.
Starting hold out prediction with 6 splits for GaussianNB.

meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T

# Set seed
if 'random_state' in stck_OV.get_params().keys():
    stck_OV.set_params(random_state=SEED)

# Optional (Add original features to meta)
original_flag = False
if original_flag:
    meta_train = np.concatenate((meta_train, X_train), axis=1)
    meta_test = np.concatenate((meta_test, X_test), axis=1)

# Fit model
stck_OV.fit(meta_train, y_train)

# Predict
y_pred = stck_OV.predict(meta_test)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred,average = 'macro')
auc = roc_auc_score(y_test, y_pred)

print('Stacking {} AUC: {:.4f}

Stacking LogisticRegression AUC: 98.4621

Classification_Assessment(stck_OV ,meta_train, ytrain_OV, meta_test, y_test)

Recall Training data:      1.0
Precision Training data:   0.9994
----------------------------------------------------------------------
Recall Test data:          0.0
Precision Test data:       0.0
----------------------------------------------------------------------
Confusion Matrix Test data
[[8259    0]
 [ 129    0]]
----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

---------------------
AUC_train: 1.000
AUC_test:  0.406
---------------------
Accuracy Training data:      0.9997
Accuracy Test data:          0.9846
----------------------------------------------------------------------
Valuation for test data only:

def Classification_Assessment_by_Threshold(model ,Xtrain, ytrain, Xtest, ytest, threshold):
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    from sklearn.metrics import accuracy_score
    import scikitplot as skplt
    from plot_metric.functions import BinaryClassification
    from sklearn.metrics import precision_recall_curve
    
    ### --------color------------------
    import colorama
    from colorama import Fore, Style

    ### ---------------New Threshold----------------------------------------   
    
    PRED_Threshold = np.where((model.predict_proba(Xtest)[:, 1])>= threshold,1,0)
       
    
    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))    
    print("----------------------------------------------------------------------")    
    
    print(Fore.BLUE + "Recall Test data (new_threshold):         ", np.round(recall_score(ytest, PRED_Threshold), decimals=4)) 
    print("Precision Test data (new_threshold):      ", np.round(precision_score(ytest,PRED_Threshold), decimals=4))
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print(Fore.BLUE +"Confusion Matrix Test data - new_threshold: ",threshold)
    print(confusion_matrix(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    print("----------------------------------------------------------------------")
    
 # https://stackoverflow.com/questions/39473297/how-do-i-print-colored-output-with-python-3   
    print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(Fore.BLUE +'Valuation for test data only (new_threshold):',threshold)
    print(classification_report(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    ## ----------AUC-----------------------------------------
     
    print('---------------------') 
    AUC_train_1 = metrics.roc_auc_score(ytrain,model.predict_proba(Xtrain)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(ytest,model.predict_proba(Xtest)[:,1])
    print('AUC_test:  
    print(Fore.BLUE +'AUC_test with new_threshold:', threshold)
    AUC_test_3 = metrics.roc_auc_score(ytest,PRED_Threshold) 
    print('AUC_test:  
    print('---------------------')    
    print(Style.RESET_ALL)
    
    print("Accuracy Training data:              ", np.round(accuracy_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Accuracy Test data                 : ", np.round(accuracy_score(ytest, model.predict(Xtest)), decimals=4))
    print(Fore.BLUE +"Accuracy Test data (new_threshold) : ", np.round(accuracy_score(ytest, PRED_Threshold), decimals=4)) 
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print('Valuation for test data only:')

    y_probas1 = PRED_Threshold
    y_probas3 = model.predict_proba(Xtest)[:,1]
    y_probas2 = model.predict_proba(Xtest)

### ---plot_roc_curve--------------------------------------------------------
    plt.figure(figsize=(13,4))

    plt.subplot(1, 2, 1)
    bc = BinaryClassification(ytest, y_probas1, labels=["Class 1", "Class 2"])
    bc2 = BinaryClassification(ytest, y_probas3, labels=["Class 1", "Class 2"])
    bc.plot_roc_curve()
    bc2.plot_roc_curve() 
    #plt.axvline(threshold, color = 'blue', linestyle = '--', label = 'new threshold')
    # plt.axvline(0.5, color = '#00C251', linestyle = '--', label = 'threshold = 0.5')

### --------precision_recall_curve------------------------------------------

    plt.subplot(1, 2, 2)
    precision, recall, thresholds = precision_recall_curve(ytest, y_probas1)
    plt.plot(recall, precision, marker='.', label=model)
    plt.title('Precision recall curve')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    #plt.legend(loc=(-0.30, -0.7))
    plt.show()

## ----------plot_roc-----------------------------------------

    skplt.metrics.plot_roc(ytest, y_probas2)

threshold = 0.3
Classification_Assessment_by_Threshold(stck_clf ,meta_train, y_train, meta_test, y_test, threshold)

Recall Training data:      0.8636
Precision Training data:   0.8517
----------------------------------------------------------------------
Recall Test data:          0.5504
Precision Test data:       0.0744
----------------------------------------------------------------------
Recall Test data (new_threshold):          0.7752
Precision Test data (new_threshold):       0.0522
----------------------------------------------------------------------

[[7376  883]
 [  58   71]]
Confusion Matrix Test data - new_threshold:  0.3
[[6442 1817]
 [  29  100]]

----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.99      0.89      0.94      8259
           1       0.07      0.55      0.13       129

    accuracy                           0.89      8388
   macro avg       0.53      0.72      0.54      8388
weighted avg       0.98      0.89      0.93      8388

----------------------------------------------------------------------
Valuation for test data only (new_threshold): 0.3
              precision    recall  f1-score   support

           0       1.00      0.78      0.87      8259
           1       0.05      0.78      0.10       129

    accuracy                           0.78      8388
   macro avg       0.52      0.78      0.49      8388
weighted avg       0.98      0.78      0.86      8388


---------------------
AUC_train: 0.949
AUC_test:  0.854
AUC_test with new_threshold: 0.3
AUC_test:  0.778
---------------------

Accuracy Training data:               0.8558
Accuracy Test data                 :  0.8878
Accuracy Test data (new_threshold) :  0.7799
----------------------------------------------------------------------

Valuation for test data only:

Create meta-features for testing data¶

meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T

meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T

Predict on Stacking Classifier¶

# Set seed
if 'random_state' in stck_OV.get_params().keys():
    stck_OV.set_params(random_state=SEED)

# Optional (Add original features to meta)
original_flag = False
if original_flag:
    meta_train = np.concatenate((meta_train, X_train), axis=1)
    meta_test = np.concatenate((meta_test, X_test), axis=1)

# Fit model
stck_OV.fit(meta_train, y_train)

# Predict
y_pred = stck_OV.predict(meta_test)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred,average = 'macro')
auc = roc_auc_score(y_test, y_pred)

print('Stacking {} AUC: {:.4f}

# Set seed
if 'random_state' in stck_OV.get_params().keys():
    stck_OV.set_params(random_state=SEED)

# Optional (Add original features to meta)
original_flag = False
if original_flag:
    meta_train = np.concatenate((meta_train, X_train), axis=1)
    meta_test = np.concatenate((meta_test, X_test), axis=1)

# Fit model
stck_OV.fit(meta_train, y_train)

# Predict
y_pred = stck_OV.predict(meta_test)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred,average = 'macro')
auc = roc_auc_score(y_test, y_pred)

print('Stacking {} AUC: {:.4f}

Stacking LogisticRegression AUC: 98.4621

Classification_Assessment(stck_OV ,meta_train, ytrain_OV, meta_test, y_test)

Recall Training data:      1.0
Precision Training data:   0.9994
----------------------------------------------------------------------
Recall Test data:          0.0
Precision Test data:       0.0
----------------------------------------------------------------------
Confusion Matrix Test data
[[8259    0]
 [ 129    0]]
----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

---------------------
AUC_train: 1.000
AUC_test:  0.406
---------------------
Accuracy Training data:      0.9997
Accuracy Test data:          0.9846
----------------------------------------------------------------------
Valuation for test data only:

def Classification_Assessment_by_Threshold(model ,Xtrain, ytrain, Xtest, ytest, threshold):
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    from sklearn.metrics import accuracy_score
    import scikitplot as skplt
    from plot_metric.functions import BinaryClassification
    from sklearn.metrics import precision_recall_curve
    
    ### --------color------------------
    import colorama
    from colorama import Fore, Style

    ### ---------------New Threshold----------------------------------------   
    
    PRED_Threshold = np.where((model.predict_proba(Xtest)[:, 1])>= threshold,1,0)
       
    
    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))    
    print("----------------------------------------------------------------------")    
    
    print(Fore.BLUE + "Recall Test data (new_threshold):         ", np.round(recall_score(ytest, PRED_Threshold), decimals=4)) 
    print("Precision Test data (new_threshold):      ", np.round(precision_score(ytest,PRED_Threshold), decimals=4))
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print(Fore.BLUE +"Confusion Matrix Test data - new_threshold: ",threshold)
    print(confusion_matrix(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    print("----------------------------------------------------------------------")
    
 # https://stackoverflow.com/questions/39473297/how-do-i-print-colored-output-with-python-3   
    print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(Fore.BLUE +'Valuation for test data only (new_threshold):',threshold)
    print(classification_report(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    ## ----------AUC-----------------------------------------
     
    print('---------------------') 
    AUC_train_1 = metrics.roc_auc_score(ytrain,model.predict_proba(Xtrain)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(ytest,model.predict_proba(Xtest)[:,1])
    print('AUC_test:  
    print(Fore.BLUE +'AUC_test with new_threshold:', threshold)
    AUC_test_3 = metrics.roc_auc_score(ytest,PRED_Threshold) 
    print('AUC_test:  
    print('---------------------')    
    print(Style.RESET_ALL)
    
    print("Accuracy Training data:              ", np.round(accuracy_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Accuracy Test data                 : ", np.round(accuracy_score(ytest, model.predict(Xtest)), decimals=4))
    print(Fore.BLUE +"Accuracy Test data (new_threshold) : ", np.round(accuracy_score(ytest, PRED_Threshold), decimals=4)) 
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print('Valuation for test data only:')

    y_probas1 = PRED_Threshold
    y_probas3 = model.predict_proba(Xtest)[:,1]
    y_probas2 = model.predict_proba(Xtest)

### ---plot_roc_curve--------------------------------------------------------
    plt.figure(figsize=(13,4))

    plt.subplot(1, 2, 1)
    bc = BinaryClassification(ytest, y_probas1, labels=["Class 1", "Class 2"])
    bc2 = BinaryClassification(ytest, y_probas3, labels=["Class 1", "Class 2"])
    bc.plot_roc_curve()
    bc2.plot_roc_curve() 
    #plt.axvline(threshold, color = 'blue', linestyle = '--', label = 'new threshold')
    # plt.axvline(0.5, color = '#00C251', linestyle = '--', label = 'threshold = 0.5')

### --------precision_recall_curve------------------------------------------

    plt.subplot(1, 2, 2)
    precision, recall, thresholds = precision_recall_curve(ytest, y_probas1)
    plt.plot(recall, precision, marker='.', label=model)
    plt.title('Precision recall curve')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    #plt.legend(loc=(-0.30, -0.7))
    plt.show()

## ----------plot_roc-----------------------------------------

    skplt.metrics.plot_roc(ytest, y_probas2)

threshold = 0.3
Classification_Assessment_by_Threshold(stck_clf ,meta_train, y_train, meta_test, y_test, threshold)

Recall Training data:      0.8636
Precision Training data:   0.8517
----------------------------------------------------------------------
Recall Test data:          0.5504
Precision Test data:       0.0744
----------------------------------------------------------------------
Recall Test data (new_threshold):          0.7752
Precision Test data (new_threshold):       0.0522
----------------------------------------------------------------------

[[7376  883]
 [  58   71]]
Confusion Matrix Test data - new_threshold:  0.3
[[6442 1817]
 [  29  100]]

----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.99      0.89      0.94      8259
           1       0.07      0.55      0.13       129

    accuracy                           0.89      8388
   macro avg       0.53      0.72      0.54      8388
weighted avg       0.98      0.89      0.93      8388

----------------------------------------------------------------------
Valuation for test data only (new_threshold): 0.3
              precision    recall  f1-score   support

           0       1.00      0.78      0.87      8259
           1       0.05      0.78      0.10       129

    accuracy                           0.78      8388
   macro avg       0.52      0.78      0.49      8388
weighted avg       0.98      0.78      0.86      8388


---------------------
AUC_train: 0.949
AUC_test:  0.854
AUC_test with new_threshold: 0.3
AUC_test:  0.778
---------------------

Accuracy Training data:               0.8558
Accuracy Test data                 :  0.8878
Accuracy Test data (new_threshold) :  0.7799
----------------------------------------------------------------------

Valuation for test data only:

Classification_Assessment(stck_OV ,meta_train, ytrain_OV, meta_test, y_test)

Classification_Assessment(stck_OV ,meta_train, ytrain_OV, meta_test, y_test)

Recall Training data:      1.0
Precision Training data:   0.9994
----------------------------------------------------------------------
Recall Test data:          0.0
Precision Test data:       0.0
----------------------------------------------------------------------
Confusion Matrix Test data
[[8259    0]
 [ 129    0]]
----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

---------------------
AUC_train: 1.000
AUC_test:  0.406
---------------------
Accuracy Training data:      0.9997
Accuracy Test data:          0.9846
----------------------------------------------------------------------
Valuation for test data only:

def Classification_Assessment_by_Threshold(model ,Xtrain, ytrain, Xtest, ytest, threshold):
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    from sklearn.metrics import accuracy_score
    import scikitplot as skplt
    from plot_metric.functions import BinaryClassification
    from sklearn.metrics import precision_recall_curve
    
    ### --------color------------------
    import colorama
    from colorama import Fore, Style

    ### ---------------New Threshold----------------------------------------   
    
    PRED_Threshold = np.where((model.predict_proba(Xtest)[:, 1])>= threshold,1,0)
       
    
    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))    
    print("----------------------------------------------------------------------")    
    
    print(Fore.BLUE + "Recall Test data (new_threshold):         ", np.round(recall_score(ytest, PRED_Threshold), decimals=4)) 
    print("Precision Test data (new_threshold):      ", np.round(precision_score(ytest,PRED_Threshold), decimals=4))
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print(Fore.BLUE +"Confusion Matrix Test data - new_threshold: ",threshold)
    print(confusion_matrix(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    print("----------------------------------------------------------------------")
    
 # https://stackoverflow.com/questions/39473297/how-do-i-print-colored-output-with-python-3   
    print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(Fore.BLUE +'Valuation for test data only (new_threshold):',threshold)
    print(classification_report(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    ## ----------AUC-----------------------------------------
     
    print('---------------------') 
    AUC_train_1 = metrics.roc_auc_score(ytrain,model.predict_proba(Xtrain)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(ytest,model.predict_proba(Xtest)[:,1])
    print('AUC_test:  
    print(Fore.BLUE +'AUC_test with new_threshold:', threshold)
    AUC_test_3 = metrics.roc_auc_score(ytest,PRED_Threshold) 
    print('AUC_test:  
    print('---------------------')    
    print(Style.RESET_ALL)
    
    print("Accuracy Training data:              ", np.round(accuracy_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Accuracy Test data                 : ", np.round(accuracy_score(ytest, model.predict(Xtest)), decimals=4))
    print(Fore.BLUE +"Accuracy Test data (new_threshold) : ", np.round(accuracy_score(ytest, PRED_Threshold), decimals=4)) 
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print('Valuation for test data only:')

    y_probas1 = PRED_Threshold
    y_probas3 = model.predict_proba(Xtest)[:,1]
    y_probas2 = model.predict_proba(Xtest)

### ---plot_roc_curve--------------------------------------------------------
    plt.figure(figsize=(13,4))

    plt.subplot(1, 2, 1)
    bc = BinaryClassification(ytest, y_probas1, labels=["Class 1", "Class 2"])
    bc2 = BinaryClassification(ytest, y_probas3, labels=["Class 1", "Class 2"])
    bc.plot_roc_curve()
    bc2.plot_roc_curve() 
    #plt.axvline(threshold, color = 'blue', linestyle = '--', label = 'new threshold')
    # plt.axvline(0.5, color = '#00C251', linestyle = '--', label = 'threshold = 0.5')

### --------precision_recall_curve------------------------------------------

    plt.subplot(1, 2, 2)
    precision, recall, thresholds = precision_recall_curve(ytest, y_probas1)
    plt.plot(recall, precision, marker='.', label=model)
    plt.title('Precision recall curve')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    #plt.legend(loc=(-0.30, -0.7))
    plt.show()

## ----------plot_roc-----------------------------------------

    skplt.metrics.plot_roc(ytest, y_probas2)

threshold = 0.3
Classification_Assessment_by_Threshold(stck_clf ,meta_train, y_train, meta_test, y_test, threshold)

Recall Training data:      0.8636
Precision Training data:   0.8517
----------------------------------------------------------------------
Recall Test data:          0.5504
Precision Test data:       0.0744
----------------------------------------------------------------------
Recall Test data (new_threshold):          0.7752
Precision Test data (new_threshold):       0.0522
----------------------------------------------------------------------

[[7376  883]
 [  58   71]]
Confusion Matrix Test data - new_threshold:  0.3
[[6442 1817]
 [  29  100]]

----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.99      0.89      0.94      8259
           1       0.07      0.55      0.13       129

    accuracy                           0.89      8388
   macro avg       0.53      0.72      0.54      8388
weighted avg       0.98      0.89      0.93      8388

----------------------------------------------------------------------
Valuation for test data only (new_threshold): 0.3
              precision    recall  f1-score   support

           0       1.00      0.78      0.87      8259
           1       0.05      0.78      0.10       129

    accuracy                           0.78      8388
   macro avg       0.52      0.78      0.49      8388
weighted avg       0.98      0.78      0.86      8388


---------------------
AUC_train: 0.949
AUC_test:  0.854
AUC_test with new_threshold: 0.3
AUC_test:  0.778
---------------------

Accuracy Training data:               0.8558
Accuracy Test data                 :  0.8878
Accuracy Test data (new_threshold) :  0.7799
----------------------------------------------------------------------

Valuation for test data only:

If we were processing Titanic data, I would expect less of a catastrophe like this. All in all I can’t explain what is wrong because it should play normally, because in models the thrashold point (red point) was at the top of the ROC cross. Unfortunately, when creating the second-level stacking classification, something got lost. This is not a mistake, because I repeated this analysis several times. Something is wrong and I don’t know what and I have no idea.
Now we will start playing thrashold sensitivity control so that the model finally begins classifying the result variables 1.

I wrote on the basis of the previous code, a program that will modernize the threshold. A thicket of numbers and names begins, so I introduced colors to print.

Threshold¶

def Classification_Assessment_by_Threshold(model ,Xtrain, ytrain, Xtest, ytest, threshold):
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    from sklearn.metrics import accuracy_score
    import scikitplot as skplt
    from plot_metric.functions import BinaryClassification
    from sklearn.metrics import precision_recall_curve
    
    ### --------color------------------
    import colorama
    from colorama import Fore, Style

    ### ---------------New Threshold----------------------------------------   
    
    PRED_Threshold = np.where((model.predict_proba(Xtest)[:, 1])>= threshold,1,0)
       
    
    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))    
    print("----------------------------------------------------------------------")    
    
    print(Fore.BLUE + "Recall Test data (new_threshold):         ", np.round(recall_score(ytest, PRED_Threshold), decimals=4)) 
    print("Precision Test data (new_threshold):      ", np.round(precision_score(ytest,PRED_Threshold), decimals=4))
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print(Fore.BLUE +"Confusion Matrix Test data - new_threshold: ",threshold)
    print(confusion_matrix(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    print("----------------------------------------------------------------------")
    
 # https://stackoverflow.com/questions/39473297/how-do-i-print-colored-output-with-python-3   
    print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(Fore.BLUE +'Valuation for test data only (new_threshold):',threshold)
    print(classification_report(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    ## ----------AUC-----------------------------------------
     
    print('---------------------') 
    AUC_train_1 = metrics.roc_auc_score(ytrain,model.predict_proba(Xtrain)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(ytest,model.predict_proba(Xtest)[:,1])
    print('AUC_test:  
    print(Fore.BLUE +'AUC_test with new_threshold:', threshold)
    AUC_test_3 = metrics.roc_auc_score(ytest,PRED_Threshold) 
    print('AUC_test:  
    print('---------------------')    
    print(Style.RESET_ALL)
    
    print("Accuracy Training data:              ", np.round(accuracy_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Accuracy Test data                 : ", np.round(accuracy_score(ytest, model.predict(Xtest)), decimals=4))
    print(Fore.BLUE +"Accuracy Test data (new_threshold) : ", np.round(accuracy_score(ytest, PRED_Threshold), decimals=4)) 
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print('Valuation for test data only:')

    y_probas1 = PRED_Threshold
    y_probas3 = model.predict_proba(Xtest)[:,1]
    y_probas2 = model.predict_proba(Xtest)

### ---plot_roc_curve--------------------------------------------------------
    plt.figure(figsize=(13,4))

    plt.subplot(1, 2, 1)
    bc = BinaryClassification(ytest, y_probas1, labels=["Class 1", "Class 2"])
    bc2 = BinaryClassification(ytest, y_probas3, labels=["Class 1", "Class 2"])
    bc.plot_roc_curve()
    bc2.plot_roc_curve() 
    #plt.axvline(threshold, color = 'blue', linestyle = '--', label = 'new threshold')
    # plt.axvline(0.5, color = '#00C251', linestyle = '--', label = 'threshold = 0.5')

### --------precision_recall_curve------------------------------------------

    plt.subplot(1, 2, 2)
    precision, recall, thresholds = precision_recall_curve(ytest, y_probas1)
    plt.plot(recall, precision, marker='.', label=model)
    plt.title('Precision recall curve')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    #plt.legend(loc=(-0.30, -0.7))
    plt.show()

## ----------plot_roc-----------------------------------------

    skplt.metrics.plot_roc(ytest, y_probas2)

def Classification_Assessment_by_Threshold(model ,Xtrain, ytrain, Xtest, ytest, threshold):
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    from sklearn.metrics import accuracy_score
    import scikitplot as skplt
    from plot_metric.functions import BinaryClassification
    from sklearn.metrics import precision_recall_curve
    
    ### --------color------------------
    import colorama
    from colorama import Fore, Style

    ### ---------------New Threshold----------------------------------------   
    
    PRED_Threshold = np.where((model.predict_proba(Xtest)[:, 1])>= threshold,1,0)
       
    
    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))    
    print("----------------------------------------------------------------------")    
    
    print(Fore.BLUE + "Recall Test data (new_threshold):         ", np.round(recall_score(ytest, PRED_Threshold), decimals=4)) 
    print("Precision Test data (new_threshold):      ", np.round(precision_score(ytest,PRED_Threshold), decimals=4))
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print(Fore.BLUE +"Confusion Matrix Test data - new_threshold: ",threshold)
    print(confusion_matrix(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    print("----------------------------------------------------------------------")
    
 # https://stackoverflow.com/questions/39473297/how-do-i-print-colored-output-with-python-3   
    print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(Fore.BLUE +'Valuation for test data only (new_threshold):',threshold)
    print(classification_report(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    ## ----------AUC-----------------------------------------
     
    print('---------------------') 
    AUC_train_1 = metrics.roc_auc_score(ytrain,model.predict_proba(Xtrain)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(ytest,model.predict_proba(Xtest)[:,1])
    print('AUC_test:  
    print(Fore.BLUE +'AUC_test with new_threshold:', threshold)
    AUC_test_3 = metrics.roc_auc_score(ytest,PRED_Threshold) 
    print('AUC_test:  
    print('---------------------')    
    print(Style.RESET_ALL)
    
    print("Accuracy Training data:              ", np.round(accuracy_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Accuracy Test data                 : ", np.round(accuracy_score(ytest, model.predict(Xtest)), decimals=4))
    print(Fore.BLUE +"Accuracy Test data (new_threshold) : ", np.round(accuracy_score(ytest, PRED_Threshold), decimals=4)) 
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print('Valuation for test data only:')

    y_probas1 = PRED_Threshold
    y_probas3 = model.predict_proba(Xtest)[:,1]
    y_probas2 = model.predict_proba(Xtest)

### ---plot_roc_curve--------------------------------------------------------
    plt.figure(figsize=(13,4))

    plt.subplot(1, 2, 1)
    bc = BinaryClassification(ytest, y_probas1, labels=["Class 1", "Class 2"])
    bc2 = BinaryClassification(ytest, y_probas3, labels=["Class 1", "Class 2"])
    bc.plot_roc_curve()
    bc2.plot_roc_curve() 
    #plt.axvline(threshold, color = 'blue', linestyle = '--', label = 'new threshold')
    # plt.axvline(0.5, color = '#00C251', linestyle = '--', label = 'threshold = 0.5')

### --------precision_recall_curve------------------------------------------

    plt.subplot(1, 2, 2)
    precision, recall, thresholds = precision_recall_curve(ytest, y_probas1)
    plt.plot(recall, precision, marker='.', label=model)
    plt.title('Precision recall curve')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    #plt.legend(loc=(-0.30, -0.7))
    plt.show()

## ----------plot_roc-----------------------------------------

    skplt.metrics.plot_roc(ytest, y_probas2)

threshold = 0.3
Classification_Assessment_by_Threshold(stck_clf ,meta_train, y_train, meta_test, y_test, threshold)

threshold = 0.3
Classification_Assessment_by_Threshold(stck_clf ,meta_train, y_train, meta_test, y_test, threshold)

Recall Training data:      0.8636
Precision Training data:   0.8517
----------------------------------------------------------------------
Recall Test data:          0.5504
Precision Test data:       0.0744
----------------------------------------------------------------------
Recall Test data (new_threshold):          0.7752
Precision Test data (new_threshold):       0.0522
----------------------------------------------------------------------

[[7376  883]
 [  58   71]]
Confusion Matrix Test data - new_threshold:  0.3
[[6442 1817]
 [  29  100]]

----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.99      0.89      0.94      8259
           1       0.07      0.55      0.13       129

    accuracy                           0.89      8388
   macro avg       0.53      0.72      0.54      8388
weighted avg       0.98      0.89      0.93      8388

----------------------------------------------------------------------
Valuation for test data only (new_threshold): 0.3
              precision    recall  f1-score   support

           0       1.00      0.78      0.87      8259
           1       0.05      0.78      0.10       129

    accuracy                           0.78      8388
   macro avg       0.52      0.78      0.49      8388
weighted avg       0.98      0.78      0.86      8388


---------------------
AUC_train: 0.949
AUC_test:  0.854
AUC_test with new_threshold: 0.3
AUC_test:  0.778
---------------------

Accuracy Training data:               0.8558
Accuracy Test data                 :  0.8878
Accuracy Test data (new_threshold) :  0.7799
----------------------------------------------------------------------

Valuation for test data only:

part 1: Determining the depth of trees by visualization using visualization¶

230320201052


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()

import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/kaggletrain.csv')
df = df.dropna(how='any')
print(df.columns)
print(df.shape)
df.dtypes

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
(183, 13)

Unnamed: 0       int64
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex               int8
Age            float64
SibSp            int64
Parch            int64
Ticket           int16
Fare           float64
Cabin            int16
Embarked          int8
dtype: object

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex              int64
Age              int64
SibSp            int64
Parch            int64
Ticket           int16
Fare           float64
Cabin            int16
Embarked          int8
dtype: object

X : (183, 2)
y : (183,)

X: (183, 2)
y: (183,)

/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

del df['Unnamed: 0']
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

df.head(3)

Digitizing data in page format¶

df['Sex'] = pd.Categorical(df.Sex).codes
df['Ticket'] = pd.Categorical(df.Ticket).codes
df['Cabin'] = pd.Categorical(df.Ticket).codes
df['Embarked'] = pd.Categorical(df.Embarked).codes

df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex               int8
Age            float64
SibSp            int64
Parch            int64
Ticket           int16
Fare           float64
Cabin            int16
Embarked          int8
dtype: object

df['Sex']=df['Sex'].astype('int64')
df['Age']=df['Age'].astype('int64')
df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex              int64
Age              int64
SibSp            int64
Parch            int64
Ticket           int16
Fare           float64
Cabin            int16
Embarked          int8
dtype: object

Selection of variables divided into test and training set¶

import numpy as np
from sklearn.model_selection import train_test_split  

df2 = df[['Sex','Age','Pclass','Survived']]
X = df2[['Sex','Age']]
y = df2['Survived']

print('X :',X.shape)
print('y :',y.shape)
#Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

X : (183, 2)
y : (183,)

X: (183, 2)
y: (183,)

/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

Replacing dataframe with array¶

import numpy as np

y = np.asarray(y)
X = np.asarray(X)

print('X:',X.shape)
print('y:',y.shape)

X: (183, 2)
y: (183,)

/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

Data normalization (standardization)¶

How Random Forest classifies according to the depth of the tree¶

from helpers_05_08 import visualize_tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs

        
fig, ax = plt.subplots(1, 4, figsize=(16, 3))
fig.subplots_adjust(left=0.02, right=0.98, wspace=0.1)

#X, y = make_blobs(n_samples=300, centers=4,
#                  random_state=0, cluster_std=1.0)

for axi, depth in zip(ax, range(1,5)):
    model = DecisionTreeClassifier(max_depth=depth)
    visualize_tree(model, X, y, ax=axi)
    axi.set_title('depth = {0}'.format(depth))

/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

Random Forest model, depth 4¶

## MODEL    
from sklearn.ensemble import RandomForestClassifier

RF4 = RandomForestClassifier(max_depth=4, random_state=0)
RF4.fit(X, y)

# Predicting the Test set results
y_pred4 = RF4.predict(X)    
    

    
    
from matplotlib.colors import ListedColormap 
  
X_set, y_set = X, y 
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, 
                     stop = X_set[:, 0].max() + 1, step = 0.01), 
                     np.arange(start = X_set[:, 1].min() - 1, 
                     stop = X_set[:, 1].max() + 1, step = 0.01)) 
  
plt.contourf(X1, X2, 
             
             RF4.predict(np.array([X1.ravel(), 
             X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, 
             cmap = ListedColormap(('pink', 'white', 'grey'))) 
  
plt.xlim(X1.min(), X1.max()) 
plt.ylim(X2.min(), X2.max())

  
for i, j in enumerate(np.unique(y_set)): 
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j) 
  
plt.title('Logistic Regression (Training set)') 
plt.xlabel('Sex') # for Xlabel 
plt.ylabel('Age') # for Ylabel 
plt.legend() # to show legend 
  
# show scatter plot 
plt.show()

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

First of all, women except babies, as well as young boys up to 20 years old and young men from 20 to 30 years old were saved from the Titanic disaster. This is how he classifies the model based on two variables: sex and age.

Visualization of the Rendom Forest classification using trees 6 deep¶

def visualize_classifier(model, X, y, ax=None, cmap='Reds'):
    ax = ax or plt.gca()
    
    # Plot the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap,
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # fit the estimator
    model.fit(X, y)
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # Create a color plot with the results
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap=cmap, clim=(y.min(), y.max()),
                           zorder=1)

    ax.set(xlim=xlim, ylim=ylim)
    

## MODEL    
from sklearn.ensemble import RandomForestClassifier

RF6 = RandomForestClassifier(max_depth=6, random_state=0)
RF6.fit(X, y)

# Predicting the Test set results
y_pred6 = RF6.predict(X)    
    
    
visualize_classifier(RF6, X, y)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

visualize_classifier(DecisionTreeClassifier(), X, y)

/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

We run a forest of 240 trees, 6 depth each¶

## MODEL    
from sklearn.ensemble import RandomForestClassifier

RF6 = RandomForestClassifier(n_estimators=240, max_depth=6, random_state=0)
RF6.fit(X, y)

# Predicting the Test set results
y_pred6 = RF6.predict(X)    
    





from matplotlib.colors import ListedColormap 
  
X_set, y_set = X, y 
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, 
                     stop = X_set[:, 0].max() + 1, step = 0.01), 
                     np.arange(start = X_set[:, 1].min() - 1, 
                     stop = X_set[:, 1].max() + 1, step = 0.01)) 
  
plt.contourf(X1, X2, 
             
             RF6.predict(np.array([X1.ravel(), 
             X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, 
             cmap = ListedColormap(('pink', 'white', 'grey'))) 
  
plt.xlim(X1.min(), X1.max()) 
plt.ylim(X2.min(), X2.max())

  
for i, j in enumerate(np.unique(y_set)): 
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j) 
  
plt.title('Logistic Regression (Training set)') 
plt.xlabel('Sex') # for Xlabel 
plt.ylabel('Age') # for Ylabel 
plt.legend() # to show legend 
  
# show scatter plot 
plt.show()

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

Increasing the number of trees for the variables 'Sex’ and 'Age’ has no effect over 100 trees.¶

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import validation_curve

## źródło: https://www.dezyre.com/recipes/plot-validation-curve-in-python

## Przerabiam data frame na macierz

import numpy as np
X = np.asarray(X)
Y = np.asarray(y)

digits = load_digits()
# Create feature matrix and target vector
X, y = digits.data, digits.target
# Plot Validation Curve
    
# Create range of values for parameter
param_range = np.arange(1, 275, 2)

# Calculate accuracy on training and test set using range of parameter values
train_scores, test_scores = validation_curve(RandomForestClassifier(max_depth=6),
                               X, y, param_name="n_estimators", param_range=param_range,
                               cv=4, scoring="accuracy", n_jobs=-1)

  # Calculate mean and standard deviation for training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

    # Calculate mean and standard deviation for test set scores
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

    # Plot mean accuracy scores for training and test sets
plt.subplots(1, figsize=(17,5))
plt.plot(param_range, train_mean, label="Training score", color="black")
plt.plot(param_range, test_mean, label="Cross-validation score", color="dimgrey")

    # Plot accurancy bands for training and test sets
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color="gray")
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color="gainsboro")

    # Create plot    
plt.title("Validation Curve With Random Forest")
plt.xlabel("Number Of Trees")
plt.ylabel("Accuracy Score")
plt.tight_layout()
plt.legend(loc="best")
plt.show()

https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html

https://www.geeksforgeeks.org/principal-component-analysis-with-python/

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


df= pd.read_csv('/home/wojciech/Pulpit/1/Stroke_Prediction_NUM.csv')
print(df.shape)

df.head(5)

(29062, 20)

0    0.981144
1    0.018856
Name: Stroke, dtype: float64

Index(['ID', 'Gender', 'Hypertension', 'Heart_Disease', 'Ever_Married',
       'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI', 'Smoking_Status',
       'Stroke', 'Age_years', 'Age_years_10', 'Gender_C', 'Ever_Married_C',
       'Type_Of_Work_C', 'Residence_C', 'Smoking_Status_C', 'Age_years_10_C'],
      dtype='object')

Zbiór X treningowy:  (19471, 11)
Zbiór X testowy:     (9591, 11)
Zbiór y treningowy:  (19471,)
Zbiór y testowy:     (9591,)

ytrain = 0:  19104
ytrain = 1:  367

Ilość 0 Stroke na 1 Stroke:  52

19084

19084

ilość elementów w zbiorze Xtrain:      19471
ilość elementów w zbiorze Xtrain_OVSA:  38555
ilość elementów w zbiorze ytrain:      19471
ilość elementów w zbiorze ytrain_OVSA:  38555

Analysis of the result variable’s balance level ¶

del df['Unnamed: 0']
df.Stroke.value_counts(dropna = False, normalize=True)

0    0.981144
1    0.018856
Name: Stroke, dtype: float64

df.columns

Index(['ID', 'Gender', 'Hypertension', 'Heart_Disease', 'Ever_Married',
       'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI', 'Smoking_Status',
       'Stroke', 'Age_years', 'Age_years_10', 'Gender_C', 'Ever_Married_C',
       'Type_Of_Work_C', 'Residence_C', 'Smoking_Status_C', 'Age_years_10_C'],
      dtype='object')

Split into test and result set ¶

df2 = df[['Hypertension','Heart_Disease','Avg_Glucose','BMI','Stroke','Age_years','Gender_C','Ever_Married_C','Type_Of_Work_C','Residence_C','Smoking_Status_C','Age_years_10_C']]

y = df2['Stroke']
X = df2.drop('Stroke', axis=1)

from sklearn.model_selection import train_test_split 
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.33, stratify = y, random_state = 148)

print ('Zbiór X treningowy: ',Xtrain.shape)
print ('Zbiór X testowy:    ', Xtest.shape)
print ('Zbiór y treningowy: ', ytrain.shape)
print ('Zbiór y testowy:    ', ytest.shape)

Zbiór X treningowy:  (19471, 11)
Zbiór X testowy:     (9591, 11)
Zbiór y treningowy:  (19471,)
Zbiór y testowy:     (9591,)

ytrain = 0:  19104
ytrain = 1:  367

Ilość 0 Stroke na 1 Stroke:  52

19084

19084

ilość elementów w zbiorze Xtrain:      19471
ilość elementów w zbiorze Xtrain_OVSA:  38555
ilość elementów w zbiorze ytrain:      19471
ilość elementów w zbiorze ytrain_OVSA:  38555

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7956
Precision Training data:   0.7522
----------------------------------------------------------------------
Recall Test data:          0.7735
Precision Test data:       0.0517
----------------------------------------------------------------------
Confusion Matrix Test data
[[6840 2570]
 [  41  140]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.73      0.84      9410
           1       0.05      0.77      0.10       181

    accuracy                           0.73      9591
   macro avg       0.52      0.75      0.47      9591
weighted avg       0.98      0.73      0.83      9591

auc 0.7501834770815108

print("ytrain = 0: ", sum(ytrain == 0))
print("ytrain = 1: ", sum(ytrain == 1))

ytrain = 0:  19104
ytrain = 1:  367

Ilość 0 Stroke na 1 Stroke:  52

19084

19084

ilość elementów w zbiorze Xtrain:      19471
ilość elementów w zbiorze Xtrain_OVSA:  38555
ilość elementów w zbiorze ytrain:      19471
ilość elementów w zbiorze ytrain_OVSA:  38555

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7956
Precision Training data:   0.7522
----------------------------------------------------------------------
Recall Test data:          0.7735
Precision Test data:       0.0517
----------------------------------------------------------------------
Confusion Matrix Test data
[[6840 2570]
 [  41  140]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.73      0.84      9410
           1       0.05      0.77      0.10       181

    accuracy                           0.73      9591
   macro avg       0.52      0.75      0.47      9591
weighted avg       0.98      0.73      0.83      9591

auc 0.7501834770815108

(38555, 11)
(9591, 11)

Proporcja = sum(ytrain == 0) / sum(ytrain == 1) 
Proporcja = np.round(Proporcja, decimals=0)
Proporcja = Proporcja.astype(int)
print('Ilość 0 Stroke na 1 Stroke: ', Proporcja)

Ilość 0 Stroke na 1 Stroke:  52

19084

19084

ilość elementów w zbiorze Xtrain:      19471
ilość elementów w zbiorze Xtrain_OVSA:  38555
ilość elementów w zbiorze ytrain:      19471
ilość elementów w zbiorze ytrain_OVSA:  38555

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7956
Precision Training data:   0.7522
----------------------------------------------------------------------
Recall Test data:          0.7735
Precision Test data:       0.0517
----------------------------------------------------------------------
Confusion Matrix Test data
[[6840 2570]
 [  41  140]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.73      0.84      9410
           1       0.05      0.77      0.10       181

    accuracy                           0.73      9591
   macro avg       0.52      0.75      0.47      9591
weighted avg       0.98      0.73      0.83      9591

auc 0.7501834770815108

(38555, 11)
(9591, 11)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

ytrain_OVSA = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
ytrain_OVSA.count()

19084

We have increased the number of result variables 1. We now have the same number of rows of result variables and independent variables. We are now introducing new additional variables 1 to the training set.

Xtrain_OVSA = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
ytrain_OVSA.count()

19084

ytrain_OVSA = pd.concat([ytrain, ytrain_OVSA], axis = 0).reset_index(drop = True)
Xtrain_OVSA = pd.concat([Xtrain, Xtrain_OVSA], axis = 0).reset_index(drop = True)

print("ilość elementów w zbiorze Xtrain:     ", Xtrain.BMI.count())
print("ilość elementów w zbiorze Xtrain_OVSA: ", Xtrain_OVSA.BMI.count())
print("ilość elementów w zbiorze ytrain:     ", ytrain.count())
print("ilość elementów w zbiorze ytrain_OVSA: ", ytrain_OVSA.count())

ilość elementów w zbiorze Xtrain:      19471
ilość elementów w zbiorze Xtrain_OVSA:  38555
ilość elementów w zbiorze ytrain:      19471
ilość elementów w zbiorze ytrain_OVSA:  38555

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7956
Precision Training data:   0.7522
----------------------------------------------------------------------
Recall Test data:          0.7735
Precision Test data:       0.0517
----------------------------------------------------------------------
Confusion Matrix Test data
[[6840 2570]
 [  41  140]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.73      0.84      9410
           1       0.05      0.77      0.10       181

    accuracy                           0.73      9591
   macro avg       0.52      0.75      0.47      9591
weighted avg       0.98      0.73      0.83      9591

auc 0.7501834770815108

(38555, 11)
(9591, 11)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.63231701, 0.36768299])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Result set balance level:

ytrain_OVSA.value_counts(dropna = False, normalize=True).plot(kind='pie')

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7956
Precision Training data:   0.7522
----------------------------------------------------------------------
Recall Test data:          0.7735
Precision Test data:       0.0517
----------------------------------------------------------------------
Confusion Matrix Test data
[[6840 2570]
 [  41  140]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.73      0.84      9410
           1       0.05      0.77      0.10       181

    accuracy                           0.73      9591
   macro avg       0.52      0.75      0.47      9591
weighted avg       0.98      0.73      0.83      9591

auc 0.7501834770815108

(38555, 11)
(9591, 11)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.63231701, 0.36768299])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.782
Precision Training data:   0.7434
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.051
----------------------------------------------------------------------
Confusion Matrix Test data
[[6751 2659]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.72      0.83      9410
           1       0.05      0.79      0.10       181

    accuracy                           0.72      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.72      0.82      9591

Logistic regression model ¶

from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(Xtrain_OVSA, ytrain_OVSA) 
y_pred_LRC = LR_Grid.predict(Xtest)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7956
Precision Training data:   0.7522
----------------------------------------------------------------------
Recall Test data:          0.7735
Precision Test data:       0.0517
----------------------------------------------------------------------
Confusion Matrix Test data
[[6840 2570]
 [  41  140]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.73      0.84      9410
           1       0.05      0.77      0.10       181

    accuracy                           0.73      9591
   macro avg       0.52      0.75      0.47      9591
weighted avg       0.98      0.73      0.83      9591

auc 0.7501834770815108

(38555, 11)
(9591, 11)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.63231701, 0.36768299])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.782
Precision Training data:   0.7434
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.051
----------------------------------------------------------------------
Confusion Matrix Test data
[[6751 2659]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.72      0.83      9410
           1       0.05      0.79      0.10       181

    accuracy                           0.72      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.72      0.82      9591

auc 0.7537417582094985

Model assessment:

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain_OVSA, LR_Grid.predict(Xtrain_OVSA)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain_OVSA, LR_Grid.predict(Xtrain_OVSA)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC)
auc = metrics.roc_auc_score(ytest, y_pred_LRC)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)

Recall Training data:      0.7956
Precision Training data:   0.7522
----------------------------------------------------------------------
Recall Test data:          0.7735
Precision Test data:       0.0517
----------------------------------------------------------------------
Confusion Matrix Test data
[[6840 2570]
 [  41  140]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.73      0.84      9410
           1       0.05      0.77      0.10       181

    accuracy                           0.73      9591
   macro avg       0.52      0.75      0.47      9591
weighted avg       0.98      0.73      0.83      9591

auc 0.7501834770815108

(38555, 11)
(9591, 11)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.63231701, 0.36768299])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.782
Precision Training data:   0.7434
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.051
----------------------------------------------------------------------
Confusion Matrix Test data
[[6751 2659]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.72      0.83      9410
           1       0.05      0.79      0.10       181

    accuracy                           0.72      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.72      0.82      9591

auc 0.7537417582094985

(38555, 2)
(19471,)

Principal component analysis (PCA)¶

Standardization of Xtrain_OVSA and Xtest variables

from sklearn.preprocessing import StandardScaler 
sc = StandardScaler() 
  
X_train_PCA = sc.fit_transform(Xtrain_OVSA) 
X_test_PCA = sc.transform(Xtest)

print(X_train_PCA.shape)
print(X_test_PCA.shape)

(38555, 11)
(9591, 11)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.63231701, 0.36768299])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.782
Precision Training data:   0.7434
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.051
----------------------------------------------------------------------
Confusion Matrix Test data
[[6751 2659]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.72      0.83      9410
           1       0.05      0.79      0.10       181

    accuracy                           0.72      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.72      0.82      9591

auc 0.7537417582094985

(38555, 2)
(19471,)

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

PCA transformation of two variables¶

from sklearn.decomposition import PCA 
  
pca = PCA(n_components = 2) 
  
X_train_PCA2 = pca.fit_transform(X_train_PCA) 
X_test_PCA2 = pca.transform(X_test_PCA)

pca.fit(X_train_PCA2)

PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

explained_variance = pca.explained_variance_ratio_ 
explained_variance

array([0.63231701, 0.36768299])

pca.components_

array([[1., 0.],
       [0., 1.]])

def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=3,
                    color='red',
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X_test_PCA2[:, 0], X_test_PCA2[:, 1], alpha=0.3)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.782
Precision Training data:   0.7434
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.051
----------------------------------------------------------------------
Confusion Matrix Test data
[[6751 2659]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.72      0.83      9410
           1       0.05      0.79      0.10       181

    accuracy                           0.72      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.72      0.82      9591

auc 0.7537417582094985

(38555, 2)
(19471,)

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.20660751, 0.12013921, 0.10116037])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

We substitute again for the logistic regression model¶

from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid2 = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid2.fit(X_train_PCA2, ytrain_OVSA) 
y_pred_LRC2 = LR_Grid2.predict(X_test_PCA2)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.782
Precision Training data:   0.7434
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.051
----------------------------------------------------------------------
Confusion Matrix Test data
[[6751 2659]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.72      0.83      9410
           1       0.05      0.79      0.10       181

    accuracy                           0.72      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.72      0.82      9591

auc 0.7537417582094985

(38555, 2)
(19471,)

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.20660751, 0.12013921, 0.10116037])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain_OVSA, LR_Grid2.predict(X_train_PCA2)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain_OVSA, LR_Grid2.predict(X_train_PCA2)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid2.predict(X_test_PCA2)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid2.predict(X_test_PCA2)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid2.predict(X_test_PCA2)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid2.predict(X_test_PCA2)))
y_pred_proba = LR_Grid2.predict_proba(X_test_PCA2)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC2)
auc = metrics.roc_auc_score(ytest, y_pred_LRC2)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)

Recall Training data:      0.782
Precision Training data:   0.7434
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.051
----------------------------------------------------------------------
Confusion Matrix Test data
[[6751 2659]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.72      0.83      9410
           1       0.05      0.79      0.10       181

    accuracy                           0.72      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.72      0.82      9591

auc 0.7537417582094985

(38555, 2)
(19471,)

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.20660751, 0.12013921, 0.10116037])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7875
Precision Training data:   0.7413
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.0499
----------------------------------------------------------------------
Confusion Matrix Test data
[[6690 2720]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.71      0.83      9410
           1       0.05      0.79      0.09       181

    accuracy                           0.71      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.71      0.82      9591

print(X_train_PCA2.shape)
print(ytrain.shape)

(38555, 2)
(19471,)

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.20660751, 0.12013921, 0.10116037])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7875
Precision Training data:   0.7413
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.0499
----------------------------------------------------------------------
Confusion Matrix Test data
[[6690 2720]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.71      0.83      9410
           1       0.05      0.79      0.09       181

    accuracy                           0.71      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.71      0.82      9591

auc 0.7505005254783613

(38555, 3)
(19471,)

It is clear that PCA improved the auc from 0.750 to 0.753.

Cluster visualisation¶

It is possible in such a graphic format only because there are two variables (after the PCA transformation, because there were 11 before). The area for assessing auc classification before and after PCA transformation is similar and is 0.75. now you can see what chapter ka looks like

For the training set¶

# Predicting the training set 
# result through scatter plot  
from matplotlib.colors import ListedColormap 
  
X_set, y_set = X_train_PCA2, ytrain_OVSA 
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, 
                     stop = X_set[:, 0].max() + 1, step = 0.01), 
                     np.arange(start = X_set[:, 1].min() - 1, 
                     stop = X_set[:, 1].max() + 1, step = 0.01)) 
  
plt.contourf(X1, X2, LR_Grid2.predict(np.array([X1.ravel(), 
             X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, 
             cmap = ListedColormap(('pink', 'white', 'lightgreen'))) 
  
plt.xlim(X1.min(), X1.max()) 
plt.ylim(X2.min(), X2.max()) 
  
for i, j in enumerate(np.unique(y_set)): 
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j) 
  
plt.title('Logistic Regression (Training set)') 
plt.xlabel('PC1') # for Xlabel 
plt.ylabel('PC2') # for Ylabel 
plt.legend() # to show legend 
  
# show scatter plot 
plt.show()

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.20660751, 0.12013921, 0.10116037])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7875
Precision Training data:   0.7413
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.0499
----------------------------------------------------------------------
Confusion Matrix Test data
[[6690 2720]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.71      0.83      9410
           1       0.05      0.79      0.09       181

    accuracy                           0.71      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.71      0.82      9591

auc 0.7505005254783613

(38555, 3)
(19471,)

For the test set¶

# Predicting the training set 
# result through scatter plot  
from matplotlib.colors import ListedColormap 
  
X_set, y_set = X_test_PCA2, ytest 
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, 
                     stop = X_set[:, 0].max() + 1, step = 0.01), 
                     np.arange(start = X_set[:, 1].min() - 1, 
                     stop = X_set[:, 1].max() + 1, step = 0.01)) 
  
plt.contourf(X1, X2, LR_Grid2.predict(np.array([X1.ravel(), 
             X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, 
             cmap = ListedColormap(('pink', 'white', 'lightgreen'))) 
  
plt.xlim(X1.min(), X1.max()) 
plt.ylim(X2.min(), X2.max()) 
  
for i, j in enumerate(np.unique(y_set)): 
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j) 
  
plt.title('Logistic Regression (Training set)') 
plt.xlabel('PC1') # for Xlabel 
plt.ylabel('PC2') # for Ylabel 
plt.legend() # to show legend 
  
# show scatter plot 
plt.show()

'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

array([0.20660751, 0.12013921, 0.10116037])

array([[1., 0.],
       [0., 1.]])

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7875
Precision Training data:   0.7413
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.0499
----------------------------------------------------------------------
Confusion Matrix Test data
[[6690 2720]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.71      0.83      9410
           1       0.05      0.79      0.09       181

    accuracy                           0.71      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.71      0.82      9591

auc 0.7505005254783613

(38555, 3)
(19471,)

PCA transformation of three variables¶

from sklearn.decomposition import PCA 
  
pca3 = PCA(n_components = 3) 
  
X_train_PCA3 = pca3.fit_transform(X_train_PCA) 
X_test_PCA3 = pca3.transform(X_test_PCA)

pca3.fit(X_train_PCA)

PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

Variance of the top 3 variables¶

The higher the variance, the better.

explained_variance = pca3.explained_variance_ratio_ 
explained_variance

array([0.20660751, 0.12013921, 0.10116037])

pca.components_

array([[1., 0.],
       [0., 1.]])

Again, we substitute for the logistic regression model¶

from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid3 = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid3.fit(X_train_PCA3, ytrain_OVSA) 
y_pred_LRC3 = LR_Grid3.predict(X_test_PCA3)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7875
Precision Training data:   0.7413
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.0499
----------------------------------------------------------------------
Confusion Matrix Test data
[[6690 2720]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.71      0.83      9410
           1       0.05      0.79      0.09       181

    accuracy                           0.71      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.71      0.82      9591

auc 0.7505005254783613

(38555, 3)
(19471,)

from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain_OVSA, LR_Grid3.predict(X_train_PCA3)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain_OVSA, LR_Grid3.predict(X_train_PCA3)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid3.predict(X_test_PCA3)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid3.predict(X_test_PCA3)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid3.predict(X_test_PCA3)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid3.predict(X_test_PCA3)))
y_pred_proba = LR_Grid3.predict_proba(X_test_PCA3)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC3)
auc = metrics.roc_auc_score(ytest, y_pred_LRC3)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)

Recall Training data:      0.7875
Precision Training data:   0.7413
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.0499
----------------------------------------------------------------------
Confusion Matrix Test data
[[6690 2720]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.71      0.83      9410
           1       0.05      0.79      0.09       181

    accuracy                           0.71      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.71      0.82      9591

auc 0.7505005254783613

(38555, 3)
(19471,)

print(X_train_PCA3.shape)
print(ytrain.shape)

(38555, 3)
(19471,)

200320200904

In this case, the method did not improve the model. However, there are models in which the PCA method is a very important reason for improving the properties of the model.
Loads data from the Titanic database.

import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/kaggletrain.csv')
df = df.dropna(how='any')
df.dtypes

Unnamed: 0       int64
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

df.columns

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

df.head(3)

Digitizing data in page format¶

df['Sex'] = pd.Categorical(df.Sex).codes
df['Ticket'] = pd.Categorical(df.Ticket).codes
df['Cabin'] = pd.Categorical(df.Ticket).codes
df['Embarked'] = pd.Categorical(df.Embarked).codes

Selection of variables divided into test and training set¶

import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Pclass', 'Sex', 'Age','SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

Data normalization (standardization)¶

PCA works best with a standardized feature set. We will perform standard scalar normalization to normalize our feature set.

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
Xtrain = sc.fit_transform(Xtrain)
Xtest = sc.transform(Xtest)

Principal component analysis (PCA)¶

from sklearn.decomposition import PCA

pca = PCA()
Xtrain = pca.fit_transform(Xtrain)
Xtest = pca.transform(Xtest)

We did not provide the number of components in the constructor. Therefore, all 9 variables from the set will be returned for both the training and test set.

The PCA class contains, explained_variance_ratio_which returns the variance called by each variable.

explained_variance = pca.explained_variance_ratio_

SOK = np.round(explained_variance, decimals=2)
SOK

array([0.25, 0.18, 0.18, 0.11, 0.1 , 0.07, 0.06, 0.04, 0.  ])

KOT = dict(zip(X, SOK))

KOT_sorted_keys = sorted(KOT, key=KOT.get, reverse=True)

for r in KOT_sorted_keys:
    print (r, KOT[r])

    KOT

Pclass 0.25
Sex 0.18
Age 0.18
SibSp 0.11
Parch 0.1
Ticket 0.07
Fare 0.06
Cabin 0.04
Embarked 0.0

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9877
Precision Training data:   0.6504
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.7692
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 1 12]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.33      0.08      0.12        13
           1       0.77      0.95      0.85        42

    accuracy                           0.75        55
   macro avg       0.55      0.51      0.49        55
weighted avg       0.67      0.75      0.68        55

auc 0.5146520146520146

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9383
Precision Training data:   0.6972
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.8
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 3 10]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.23      0.33        13
           1       0.80      0.95      0.87        42

    accuracy                           0.78        55
   macro avg       0.70      0.59      0.60        55
weighted avg       0.75      0.78      0.74        55

auc 0.5915750915750915

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9136
Precision Training data:   0.7115
----------------------------------------------------------------------
Recall Test data:          0.9048
Precision Test data:       0.8444
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 6  7]
 [ 4 38]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.46      0.52        13
           1       0.84      0.90      0.87        42

    accuracy                           0.80        55
   macro avg       0.72      0.68      0.70        55
weighted avg       0.79      0.80      0.79        55

We’re looking for one, the best independent variable in the model¶

from sklearn.decomposition import PCA

pca = PCA(n_components=1)
Xtrain = pca.fit_transform(Xtrain)
Xtest = pca.transform(Xtest)

from sklearn.ensemble import RandomForestClassifier

RF4 = RandomForestClassifier(max_depth=2, random_state=0)
RF4.fit(Xtrain, ytrain)

# Predicting the Test set results
y_pred1 = RF4.predict(Xtest)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9877
Precision Training data:   0.6504
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.7692
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 1 12]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.33      0.08      0.12        13
           1       0.77      0.95      0.85        42

    accuracy                           0.75        55
   macro avg       0.55      0.51      0.49        55
weighted avg       0.67      0.75      0.68        55

auc 0.5146520146520146

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9383
Precision Training data:   0.6972
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.8
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 3 10]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.23      0.33        13
           1       0.80      0.95      0.87        42

    accuracy                           0.78        55
   macro avg       0.70      0.59      0.60        55
weighted avg       0.75      0.78      0.74        55

auc 0.5915750915750915

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9136
Precision Training data:   0.7115
----------------------------------------------------------------------
Recall Test data:          0.9048
Precision Test data:       0.8444
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 6  7]
 [ 4 38]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.46      0.52        13
           1       0.84      0.90      0.87        42

    accuracy                           0.80        55
   macro avg       0.72      0.68      0.70        55
weighted avg       0.79      0.80      0.79        55

auc 0.6831501831501832

# model assessment
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, RF4.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, RF4.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, RF4.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, RF4.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, RF4.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, RF4.predict(Xtest)))
y_pred_proba = RF4.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred1)
auc = metrics.roc_auc_score(ytest, y_pred1)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)

Recall Training data:      0.9877
Precision Training data:   0.6504
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.7692
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 1 12]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.33      0.08      0.12        13
           1       0.77      0.95      0.85        42

    accuracy                           0.75        55
   macro avg       0.55      0.51      0.49        55
weighted avg       0.67      0.75      0.68        55

auc 0.5146520146520146

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9383
Precision Training data:   0.6972
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.8
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 3 10]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.23      0.33        13
           1       0.80      0.95      0.87        42

    accuracy                           0.78        55
   macro avg       0.70      0.59      0.60        55
weighted avg       0.75      0.78      0.74        55

auc 0.5915750915750915

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9136
Precision Training data:   0.7115
----------------------------------------------------------------------
Recall Test data:          0.9048
Precision Test data:       0.8444
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 6  7]
 [ 4 38]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.46      0.52        13
           1       0.84      0.90      0.87        42

    accuracy                           0.80        55
   macro avg       0.72      0.68      0.70        55
weighted avg       0.79      0.80      0.79        55

auc 0.6831501831501832

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

We’re looking for the two best independent variables in the model¶

import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Pclass', 'Sex', 'Age','SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
Xtrain = sc.fit_transform(Xtrain)
Xtest = sc.transform(Xtest)

PCA algorithm¶

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
Xtrain = pca.fit_transform(Xtrain)
Xtest = pca.transform(Xtest)

from sklearn.ensemble import RandomForestClassifier

RF2 = RandomForestClassifier(max_depth=2, random_state=0)
RF2.fit(Xtrain, ytrain)

# Predicting the Test set results
y_pred2 = RF2.predict(Xtest)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9383
Precision Training data:   0.6972
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.8
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 3 10]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.23      0.33        13
           1       0.80      0.95      0.87        42

    accuracy                           0.78        55
   macro avg       0.70      0.59      0.60        55
weighted avg       0.75      0.78      0.74        55

auc 0.5915750915750915

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9136
Precision Training data:   0.7115
----------------------------------------------------------------------
Recall Test data:          0.9048
Precision Test data:       0.8444
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 6  7]
 [ 4 38]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.46      0.52        13
           1       0.84      0.90      0.87        42

    accuracy                           0.80        55
   macro avg       0.72      0.68      0.70        55
weighted avg       0.79      0.80      0.79        55

auc 0.6831501831501832

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, RF2.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, RF2.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, RF2.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, RF2.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, RF2.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, RF2.predict(Xtest)))
y_pred_proba = RF2.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred2)
auc = metrics.roc_auc_score(ytest, y_pred2)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)

Recall Training data:      0.9383
Precision Training data:   0.6972
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.8
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 3 10]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.23      0.33        13
           1       0.80      0.95      0.87        42

    accuracy                           0.78        55
   macro avg       0.70      0.59      0.60        55
weighted avg       0.75      0.78      0.74        55

auc 0.5915750915750915

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9136
Precision Training data:   0.7115
----------------------------------------------------------------------
Recall Test data:          0.9048
Precision Test data:       0.8444
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 6  7]
 [ 4 38]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.46      0.52        13
           1       0.84      0.90      0.87        42

    accuracy                           0.80        55
   macro avg       0.72      0.68      0.70        55
weighted avg       0.79      0.80      0.79        55

auc 0.6831501831501832

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

We are looking for the three best independent variables in the model¶

import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Pclass', 'Sex', 'Age','SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
Xtrain = sc.fit_transform(Xtrain)
Xtest = sc.transform(Xtest)

#### Algorytm PCA

from sklearn.decomposition import PCA

pca = PCA(n_components=3)
Xtrain = pca.fit_transform(Xtrain)
Xtest = pca.transform(Xtest)

from sklearn.ensemble import RandomForestClassifier

RF3 = RandomForestClassifier(max_depth=2, random_state=0)
RF3.fit(Xtrain, ytrain)

# Predicting the Test set results
y_pred = RF3.predict(Xtest)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

Recall Training data:      0.9136
Precision Training data:   0.7115
----------------------------------------------------------------------
Recall Test data:          0.9048
Precision Test data:       0.8444
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 6  7]
 [ 4 38]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.46      0.52        13
           1       0.84      0.90      0.87        42

    accuracy                           0.80        55
   macro avg       0.72      0.68      0.70        55
weighted avg       0.79      0.80      0.79        55

auc 0.6831501831501832

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, RF3.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, RF3.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, RF3.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, RF3.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, RF3.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, RF3.predict(Xtest)))
y_pred_proba = RF3.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred)
auc = metrics.roc_auc_score(ytest, y_pred)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)

Recall Training data:      0.9136
Precision Training data:   0.7115
----------------------------------------------------------------------
Recall Test data:          0.9048
Precision Test data:       0.8444
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 6  7]
 [ 4 38]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.46      0.52        13
           1       0.84      0.90      0.87        42

    accuracy                           0.80        55
   macro avg       0.72      0.68      0.70        55
weighted avg       0.79      0.80      0.79        55

auc 0.6831501831501832

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(Xtrain)

X.columns

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

200320200724

In [1]:

import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/kaggletrain.csv')
df = df.dropna(how='any')
df.dtypes

Unnamed: 0       int64
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

df.columns

Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

df.head(3)

df.dtypes

Unnamed: 0       int64
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

df['Sex'] = pd.Categorical(df.Sex).codes
df['Ticket'] = pd.Categorical(df.Ticket).codes
df['Cabin'] = pd.Categorical(df.Ticket).codes
df['Embarked'] = pd.Categorical(df.Embarked).codes

import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Pclass', 'Sex', 'Age','SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

Simple classification model: Logistic regression model¶

from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(Xtrain, ytrain) 
y_pred_LRC = LR_Grid.predict(Xtest)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7901
Precision Training data:   0.8205
----------------------------------------------------------------------
Recall Test data:          0.8333
Precision Test data:       0.875
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 8  5]
 [ 7 35]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.62      0.57        13
           1       0.88      0.83      0.85        42

    accuracy                           0.78        55
   macro avg       0.70      0.72      0.71        55
weighted avg       0.79      0.78      0.79        55

auc 0.7243589743589745

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

array([0.009, 0.195, 0.254, 0.026, 0.029, 0.114, 0.212, 0.139, 0.022])

Age 0.254
Fare 0.212
Sex 0.195
Cabin 0.139
Ticket 0.114
Parch 0.029
SibSp 0.026
Embarked 0.022
Pclass 0.009

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7901
Precision Training data:   0.8649
----------------------------------------------------------------------
Recall Test data:          0.7857
Precision Test data:       0.9167
----------------------------------------------------------------------
Confusion Matrix Test data
[[10  3]
 [ 9 33]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.77      0.62        13
           1       0.92      0.79      0.85        42

    accuracy                           0.78        55
   macro avg       0.72      0.78      0.74        55
weighted avg       0.82      0.78      0.79        55

auc 0.7774725274725274

# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))

y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC)
auc = metrics.roc_auc_score(ytest, y_pred_LRC)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')

print('auc',auc)
plt.show()

Recall Training data:      0.7901
Precision Training data:   0.8205
----------------------------------------------------------------------
Recall Test data:          0.8333
Precision Test data:       0.875
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 8  5]
 [ 7 35]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.62      0.57        13
           1       0.88      0.83      0.85        42

    accuracy                           0.78        55
   macro avg       0.70      0.72      0.71        55
weighted avg       0.79      0.78      0.79        55

auc 0.7243589743589745

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

array([0.009, 0.195, 0.254, 0.026, 0.029, 0.114, 0.212, 0.139, 0.022])

Age 0.254
Fare 0.212
Sex 0.195
Cabin 0.139
Ticket 0.114
Parch 0.029
SibSp 0.026
Embarked 0.022
Pclass 0.009

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7901
Precision Training data:   0.8649
----------------------------------------------------------------------
Recall Test data:          0.7857
Precision Test data:       0.9167
----------------------------------------------------------------------
Confusion Matrix Test data
[[10  3]
 [ 9 33]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.77      0.62        13
           1       0.92      0.79      0.85        42

    accuracy                           0.78        55
   macro avg       0.72      0.78      0.74        55
weighted avg       0.82      0.78      0.79        55

auc 0.7774725274725274

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

The model that uses all independent variables has the value AUC = 0.72

We check which independent variables are the best¶

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X, y)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

array([0.009, 0.195, 0.254, 0.026, 0.029, 0.114, 0.212, 0.139, 0.022])

Age 0.254
Fare 0.212
Sex 0.195
Cabin 0.139
Ticket 0.114
Parch 0.029
SibSp 0.026
Embarked 0.022
Pclass 0.009

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7901
Precision Training data:   0.8649
----------------------------------------------------------------------
Recall Test data:          0.7857
Precision Test data:       0.9167
----------------------------------------------------------------------
Confusion Matrix Test data
[[10  3]
 [ 9 33]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.77      0.62        13
           1       0.92      0.79      0.85        42

    accuracy                           0.78        55
   macro avg       0.72      0.78      0.74        55
weighted avg       0.82      0.78      0.79        55

auc 0.7774725274725274

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.8025
Precision Training data:   0.8553
----------------------------------------------------------------------
Recall Test data:          0.8571
Precision Test data:       0.9
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 9  4]
 [ 6 36]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.69      0.64        13
           1       0.90      0.86      0.88        42

    accuracy                           0.82        55
   macro avg       0.75      0.77      0.76        55
weighted avg       0.83      0.82      0.82        55

auc 0.7747252747252747

importance = rfc.feature_importances_

importance = np.round(importance, decimals=3)
importance

array([0.009, 0.195, 0.254, 0.026, 0.029, 0.114, 0.212, 0.139, 0.022])

We sort variables by importance¶

This is a very useful function when there are many variables.

KOT = dict(zip(X, importance))
KOT_sorted_keys = sorted(KOT, key=KOT.get, reverse=True)

for r in KOT_sorted_keys:
    print (r, KOT[r])

    KOT

Age 0.254
Fare 0.212
Sex 0.195
Cabin 0.139
Ticket 0.114
Parch 0.029
SibSp 0.026
Embarked 0.022
Pclass 0.009

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7901
Precision Training data:   0.8649
----------------------------------------------------------------------
Recall Test data:          0.7857
Precision Test data:       0.9167
----------------------------------------------------------------------
Confusion Matrix Test data
[[10  3]
 [ 9 33]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.77      0.62        13
           1       0.92      0.79      0.85        42

    accuracy                           0.78        55
   macro avg       0.72      0.78      0.74        55
weighted avg       0.82      0.78      0.79        55

auc 0.7774725274725274

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.8025
Precision Training data:   0.8553
----------------------------------------------------------------------
Recall Test data:          0.8571
Precision Test data:       0.9
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 9  4]
 [ 6 36]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.69      0.64        13
           1       0.90      0.86      0.88        42

    accuracy                           0.82        55
   macro avg       0.75      0.77      0.76        55
weighted avg       0.83      0.82      0.82        55

auc 0.7747252747252747

Warning! The variables with the highest values have the highest importance.

We only use the most relevant variables: 'Sex’, 'Age’, 'Fare’, 'Ticket’¶

import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Age', 'Sex', 'Fare','Ticket']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

Simple classification model: Logistic regression model¶

from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(Xtrain, ytrain) 
y_pred_LRC = LR_Grid.predict(Xtest)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.7901
Precision Training data:   0.8649
----------------------------------------------------------------------
Recall Test data:          0.7857
Precision Test data:       0.9167
----------------------------------------------------------------------
Confusion Matrix Test data
[[10  3]
 [ 9 33]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.77      0.62        13
           1       0.92      0.79      0.85        42

    accuracy                           0.78        55
   macro avg       0.72      0.78      0.74        55
weighted avg       0.82      0.78      0.79        55

auc 0.7774725274725274

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.8025
Precision Training data:   0.8553
----------------------------------------------------------------------
Recall Test data:          0.8571
Precision Test data:       0.9
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 9  4]
 [ 6 36]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.69      0.64        13
           1       0.90      0.86      0.88        42

    accuracy                           0.82        55
   macro avg       0.75      0.77      0.76        55
weighted avg       0.83      0.82      0.82        55

auc 0.7747252747252747

# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC)
auc = metrics.roc_auc_score(ytest, y_pred_LRC)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)

Recall Training data:      0.7901
Precision Training data:   0.8649
----------------------------------------------------------------------
Recall Test data:          0.7857
Precision Test data:       0.9167
----------------------------------------------------------------------
Confusion Matrix Test data
[[10  3]
 [ 9 33]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.77      0.62        13
           1       0.92      0.79      0.85        42

    accuracy                           0.78        55
   macro avg       0.72      0.78      0.74        55
weighted avg       0.82      0.78      0.79        55

auc 0.7774725274725274

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.8025
Precision Training data:   0.8553
----------------------------------------------------------------------
Recall Test data:          0.8571
Precision Test data:       0.9
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 9  4]
 [ 6 36]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.69      0.64        13
           1       0.90      0.86      0.88        42

    accuracy                           0.82        55
   macro avg       0.75      0.77      0.76        55
weighted avg       0.83      0.82      0.82        55

auc 0.7747252747252747

Including the 'Pclass’ variable has compromised model properties!¶

It is widely known that the first class passengers had a better chance of being saved.
Feature_importances analysis showed that the variable 'Pclass’ is of little importance in the classification process.
The addition of this variable to the classification model resulted in a deterioration of the AUC surface.

import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Age', 'Sex', 'Fare','Ticket','Pclass']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(Xtrain, ytrain) 
y_pred_LRC = LR_Grid.predict(Xtest)

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Recall Training data:      0.8025
Precision Training data:   0.8553
----------------------------------------------------------------------
Recall Test data:          0.8571
Precision Test data:       0.9
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 9  4]
 [ 6 36]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.69      0.64        13
           1       0.90      0.86      0.88        42

    accuracy                           0.82        55
   macro avg       0.75      0.77      0.76        55
weighted avg       0.83      0.82      0.82        55

auc 0.7747252747252747

# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC)
auc = metrics.roc_auc_score(ytest, y_pred_LRC)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)

Recall Training data:      0.8025
Precision Training data:   0.8553
----------------------------------------------------------------------
Recall Test data:          0.8571
Precision Test data:       0.9
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 9  4]
 [ 6 36]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.69      0.64        13
           1       0.90      0.86      0.88        42

    accuracy                           0.82        55
   macro avg       0.75      0.77      0.76        55
weighted avg       0.83      0.82      0.82        55

auc 0.7747252747252747

import time
start_time = time.time() ## pomiar czasu: start pomiaru czasu
print(time.ctime())

Mon Mar  9 09:36:05 2020

(29062, 14)

Index(['Unnamed: 0', 'ID', 'Gender', 'Hypertension', 'Heart_Disease',
       'Ever_Married', 'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI',
       'Smoking_Status', 'Stroke', 'Age_years', 'Age_years_10'],
      dtype='object')

0              Private
1              Private
2              Private
3        Self-employed
4              Private
             ...      
29057         children
29058         Govt_job
29059          Private
29060          Private
29061          Private
Name: Type_Of_Work, Length: 29062, dtype: object

Unnamed: 0          int64
ID                  int64
Gender             object
Hypertension        int64
Heart_Disease       int64
Ever_Married       object
Type_Of_Work       object
Residence          object
Avg_Glucose       float64
BMI               float64
Smoking_Status     object
Stroke              int64
Age_years         float64
Age_years_10       object
dtype: object

Unnamed: 0           int64
ID                   int64
Gender            category
Hypertension      category
Heart_Disease     category
Ever_Married      category
Type_Of_Work      category
Residence         category
Avg_Glucose        float64
BMI                float64
Smoking_Status    category
Stroke               int64
Age_years          float64
Age_years_10      category
dtype: object

Index(['Rural', 'Urban'], dtype='object')

Index(['No', 'Yes'], dtype='object')

Index(['(22.041, 29.055]', '(29.055, 36.058]', '(36.058, 42.132]',
       '(42.132, 48.082]', '(48.082, 53.126]', '(53.126, 59.076]',
       '(59.076, 65.121]', '(65.121, 74.11]', '(74.11, 82.137]',
       '(9.999, 22.041]'],
      dtype='object')

import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Importuję dane¶

import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/Stroke_Prediction_CLEAR.csv')
df.head(3)

df.shape

(29062, 14)

df.Stroke.value_counts().plot(kind='pie', autopct='

Index(['Unnamed: 0', 'ID', 'Gender', 'Hypertension', 'Heart_Disease',
       'Ever_Married', 'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI',
       'Smoking_Status', 'Stroke', 'Age_years', 'Age_years_10'],
      dtype='object')

0              Private
1              Private
2              Private
3        Self-employed
4              Private
             ...      
29057         children
29058         Govt_job
29059          Private
29060          Private
29061          Private
Name: Type_Of_Work, Length: 29062, dtype: object

Unnamed: 0          int64
ID                  int64
Gender             object
Hypertension        int64
Heart_Disease       int64
Ever_Married       object
Type_Of_Work       object
Residence          object
Avg_Glucose       float64
BMI               float64
Smoking_Status     object
Stroke              int64
Age_years         float64
Age_years_10       object
dtype: object

Unnamed: 0           int64
ID                   int64
Gender            category
Hypertension      category
Heart_Disease     category
Ever_Married      category
Type_Of_Work      category
Residence         category
Avg_Glucose        float64
BMI                float64
Smoking_Status    category
Stroke               int64
Age_years          float64
Age_years_10      category
dtype: object

Index(['Rural', 'Urban'], dtype='object')

Index(['No', 'Yes'], dtype='object')

Index(['(22.041, 29.055]', '(29.055, 36.058]', '(36.058, 42.132]',
       '(42.132, 48.082]', '(48.082, 53.126]', '(53.126, 59.076]',
       '(59.076, 65.121]', '(65.121, 74.11]', '(74.11, 82.137]',
       '(9.999, 22.041]'],
      dtype='object')

Unnamed: 0           int64
ID                   int64
Gender            category
Hypertension      category
Heart_Disease     category
Ever_Married      category
Type_Of_Work      category
Residence         category
Avg_Glucose        float64
BMI                float64
Smoking_Status    category
Stroke               int64
Age_years          float64
Age_years_10      category
dtype: object

['Gender',
 'Hypertension',
 'Heart_Disease',
 'Ever_Married',
 'Type_Of_Work',
 'Residence',
 'Smoking_Status',
 'Age_years_10']

Uporządkowanie kolumn z danymi kategorycznymi i danymi ciągłymi¶

df.columns

Index(['Unnamed: 0', 'ID', 'Gender', 'Hypertension', 'Heart_Disease',
       'Ever_Married', 'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI',
       'Smoking_Status', 'Stroke', 'Age_years', 'Age_years_10'],
      dtype='object')

df.Type_Of_Work

0              Private
1              Private
2              Private
3        Self-employed
4              Private
             ...      
29057         children
29058         Govt_job
29059          Private
29060          Private
29061          Private
Name: Type_Of_Work, Length: 29062, dtype: object

categorical_columns = ['Gender','Hypertension', 'Heart_Disease','Ever_Married','Type_Of_Work','Residence','Smoking_Status','Age_years_10']
numerical_columns = ['Avg_Glucose', 'BMI', 'Age_years']

Ustalamy, że zmienną wynikową jest kolumna 'Stroke’¶

outputs = ['Stroke']

Cyfryzacja zmiennych tekstowych¶

df.dtypes

Unnamed: 0          int64
ID                  int64
Gender             object
Hypertension        int64
Heart_Disease       int64
Ever_Married       object
Type_Of_Work       object
Residence          object
Avg_Glucose       float64
BMI               float64
Smoking_Status     object
Stroke              int64
Age_years         float64
Age_years_10       object
dtype: object

Musimy przekonwertować typy dla kolumn jakościowych na category. Możemy to zrobić za pomocą astype() funkcji, jak pokazano poniżej:

Wprowadzam nowy typ danych: 'category’¶

for category in categorical_columns:
    df[category] = df[category].astype('category')

df.dtypes

Unnamed: 0           int64
ID                   int64
Gender            category
Hypertension      category
Heart_Disease     category
Ever_Married      category
Type_Of_Work      category
Residence         category
Avg_Glucose        float64
BMI                float64
Smoking_Status    category
Stroke               int64
Age_years          float64
Age_years_10      category
dtype: object

df['Residence'].cat.categories

Index(['Rural', 'Urban'], dtype='object')

df['Ever_Married'].cat.categories

Index(['No', 'Yes'], dtype='object')

df['Age_years_10'].cat.categories

Index(['(22.041, 29.055]', '(29.055, 36.058]', '(36.058, 42.132]',
       '(42.132, 48.082]', '(48.082, 53.126]', '(53.126, 59.076]',
       '(59.076, 65.121]', '(65.121, 74.11]', '(74.11, 82.137]',
       '(9.999, 22.041]'],
      dtype='object')

Cyfryzacja danych¶

df.dtypes

Unnamed: 0           int64
ID                   int64
Gender            category
Hypertension      category
Heart_Disease     category
Ever_Married      category
Type_Of_Work      category
Residence         category
Avg_Glucose        float64
BMI                float64
Smoking_Status    category
Stroke               int64
Age_years          float64
Age_years_10      category
dtype: object

Dlaczego zcyfrowaliśmy dane w formacie?¶

Podstawowym celem oddzielenia kolumn kategorycznych od kolumn numerycznych jest to, że wartości w kolumnie numerycznej mogą być bezpośrednio wprowadzane do sieci neuronowych. Jednak wartości kolumn kategorialnych należy najpierw przekonwertować na typy liczbowe.

categorical_columns

['Gender',
 'Hypertension',
 'Heart_Disease',
 'Ever_Married',
 'Type_Of_Work',
 'Residence',
 'Smoking_Status',
 'Age_years_10']

Konwersja zmiennych kategorycznych na macierz Numpy¶

p1 = df['Gender'].cat.codes.values
p2 = df['Hypertension'].cat.codes.values
p3 = df['Heart_Disease'].cat.codes.values
p4 = df['Ever_Married'].cat.codes.values
p5 = df['Type_Of_Work'].cat.codes.values
p6 = df['Residence'].cat.codes.values
p7 = df['Smoking_Status'].cat.codes.values
p8 = df['Age_years_10'].cat.codes.values

NumP_matrix = np.stack([p1, p2, p3, p4, p5, p6, p7, p8], 1)

NumP_matrix[:10]

array([[1, 1, 0, 1, 2, 1, 1, 5],
       [0, 0, 0, 1, 2, 0, 0, 7],
       [0, 0, 0, 1, 2, 1, 0, 4],
       [0, 0, 1, 1, 3, 0, 1, 8],
       [0, 0, 0, 1, 2, 0, 2, 1],
       [0, 1, 0, 1, 3, 1, 1, 7],
       [1, 0, 1, 1, 2, 1, 0, 8],
       [0, 0, 0, 1, 2, 0, 1, 2],
       [0, 0, 0, 1, 2, 0, 0, 2],
       [0, 0, 0, 1, 2, 0, 1, 2]], dtype=int8)

Tworzenie tensora Pytorch z macierzy Numpy¶

categorical_data = torch.tensor(NumP_matrix, dtype=torch.int64)
categorical_data[:10]

tensor([[1, 1, 0, 1, 2, 1, 1, 5],
        [0, 0, 0, 1, 2, 0, 0, 7],
        [0, 0, 0, 1, 2, 1, 0, 4],
        [0, 0, 1, 1, 3, 0, 1, 8],
        [0, 0, 0, 1, 2, 0, 2, 1],
        [0, 1, 0, 1, 3, 1, 1, 7],
        [1, 0, 1, 1, 2, 1, 0, 8],
        [0, 0, 0, 1, 2, 0, 1, 2],
        [0, 0, 0, 1, 2, 0, 0, 2],
        [0, 0, 0, 1, 2, 0, 1, 2]])

Konwersja kolumn numerycznych DataFrame na tensor Pytorch¶

numerical_data = np.stack([df[col].values for col in numerical_columns], 1)
numerical_data = torch.tensor(numerical_data, dtype=torch.float)
numerical_data[:5]

tensor([[ 87.9600,  39.2000,  58.0932],
        [ 69.0400,  35.9000,  70.0767],
        [ 77.5900,  17.7000,  52.0411],
        [243.5300,  27.0000,  75.1041],
        [ 77.6700,  32.3000,  32.0247]])

Konwersja zmiennych wynikowych na tensor Pytorch¶

outputs = torch.tensor(df[outputs].values).flatten()
outputs[:5]

tensor([0, 0, 0, 0, 0])

Podsumujmy tensory¶

print('categorical_data: ',categorical_data.shape)
print('numerical_data:   ',numerical_data.shape)
print('outputs:          ',outputs.shape)

categorical_data:  torch.Size([29062, 8])
numerical_data:    torch.Size([29062, 3])
outputs:           torch.Size([29062])

[(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]

categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
train_outputs:           torch.Size([23250])
----------------------------------------------------
categorical_test_data:   torch.Size([5812, 8])
numerical_test_data:     torch.Size([5812, 3])
test_outputs:            torch.Size([5812])

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3
categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
outputs:                 torch.Size([23250])

epoch:   1 loss: 0.78628689
epoch:  31 loss: 0.61815375
epoch:  61 loss: 0.54198617
epoch:  91 loss: 0.42037284
epoch: 121 loss: 0.28172502
epoch: 151 loss: 0.16479240
epoch: 181 loss: 0.11258653
epoch: 211 loss: 0.10971391
epoch: 241 loss: 0.09200522
epoch: 271 loss: 0.08967553
epoch: 300 loss: 0.0847498849

Loss train_set: 0.08532631

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

OSADZANIE¶

Przekształciliśmy nasze kolumny kategorialne na numeryczne, w których unikatowa wartość jest reprezentowana przez jedną liczbę całkowitą (cyfryzacja – np. palący to 1). Na podstawie takiej kolumny (zmiennej) możemy wyszkolić model, jednak jest lepszy sposób…

Lepszym sposobem jest reprezentowanie wartości w kolumnie kategorialnej w postaci wektora N-wymiarowego zamiast pojedynczej liczby całkowitej. Ten proces nazywa się osadzaniem. Wektor jest w stanie przechwycić więcej informacji i może znaleźć związki między różnymi wartościami kategorycznymi w bardziej odpowiedni sposób. Dlatego będziemy reprezentować wartości w kolumnach kategorialnych w postaci wektorów N-wymiarowych.

Musimy zdefiniować rozmiar osadzania (wymiary wektorowe) dla wszystkich kolumn jakościowych. Nie ma twardej i szybkiej reguły dotyczącej liczby wymiarów. Dobrą zasadą przy definiowaniu rozmiaru osadzania dla kolumny jest podzielenie liczby unikalnych wartości w kolumnie przez 2 (ale nie więcej niż 50). Na przykład dla 'Smoking_Status’ kolumny liczba unikalnych wartości wynosi 3. Odpowiedni rozmiar osadzenia dla kolumny 'Smoking_Status’będzie wynosił 3/2 = 1,5 = 2 (zaokrąglenie).

Poniższy skrypt tworzy krotkę zawierającą liczbę unikalnych wartości i rozmiarów wymiarów dla wszystkich kolumn jakościowych.

Zasada jest prosta: macierz embadding musi być zawsze w ilości wierszy większa niż zakres zmiennych w ilości wierszy: dlatego dodałem col_size+2, to duży zapas..

categorical_column_sizes = [len(df[column].cat.categories) for column in categorical_columns]
categorical_embedding_sizes = [(col_size+2, min(50, (col_size+5)//2)) for col_size in categorical_column_sizes]
print(categorical_embedding_sizes)

[(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]

categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
train_outputs:           torch.Size([23250])
----------------------------------------------------
categorical_test_data:   torch.Size([5812, 8])
numerical_test_data:     torch.Size([5812, 3])
test_outputs:            torch.Size([5812])

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3
categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
outputs:                 torch.Size([23250])

epoch:   1 loss: 0.78628689
epoch:  31 loss: 0.61815375
epoch:  61 loss: 0.54198617
epoch:  91 loss: 0.42037284
epoch: 121 loss: 0.28172502
epoch: 151 loss: 0.16479240
epoch: 181 loss: 0.11258653
epoch: 211 loss: 0.10971391
epoch: 241 loss: 0.09200522
epoch: 271 loss: 0.08967553
epoch: 300 loss: 0.0847498849

Loss train_set: 0.08532631

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

Dzielenie zestawu na szkoleniowy i testowy¶

total_records = df['ID'].count()
test_records = int(total_records * .2)

categorical_train_data = categorical_data[:total_records-test_records]
categorical_test_data = categorical_data[total_records-test_records:total_records]
numerical_train_data = numerical_data[:total_records-test_records]
numerical_test_data = numerical_data[total_records-test_records:total_records]
train_outputs = outputs[:total_records-test_records]
test_outputs = outputs[total_records-test_records:total_records]

Aby sprawdzić, czy poprawnie podzieliliśmy dane na zestawy treningów i testów, wydrukujmy długości rekordów szkolenia i testów:

print('categorical_train_data: ',categorical_train_data.shape)
print('numerical_train_data:   ',numerical_train_data.shape)
print('train_outputs:          ', train_outputs.shape)
print('----------------------------------------------------')
print('categorical_test_data:  ',categorical_test_data.shape)
print('numerical_test_data:    ',numerical_test_data.shape)
print('test_outputs:           ',test_outputs.shape)

categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
train_outputs:           torch.Size([23250])
----------------------------------------------------
categorical_test_data:   torch.Size([5812, 8])
numerical_test_data:     torch.Size([5812, 3])
test_outputs:            torch.Size([5812])

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3
categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
outputs:                 torch.Size([23250])

epoch:   1 loss: 0.78628689
epoch:  31 loss: 0.61815375
epoch:  61 loss: 0.54198617
epoch:  91 loss: 0.42037284
epoch: 121 loss: 0.28172502
epoch: 151 loss: 0.16479240
epoch: 181 loss: 0.11258653
epoch: 211 loss: 0.10971391
epoch: 241 loss: 0.09200522
epoch: 271 loss: 0.08967553
epoch: 300 loss: 0.0847498849

Loss train_set: 0.08532631

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

Tworzenie modelu klasyfikacji Pytorch¶

class Model(nn.Module):

    def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4):
        super().__init__()
        self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
        self.embedding_dropout = nn.Dropout(p)
        self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols)

        all_layers = []
        num_categorical_cols = sum((nf for ni, nf in embedding_size))
        input_size = num_categorical_cols + num_numerical_cols

        for i in layers:
            all_layers.append(nn.Linear(input_size, i))
            all_layers.append(nn.ReLU(inplace=True))
            all_layers.append(nn.BatchNorm1d(i))
            all_layers.append(nn.Dropout(p))
            input_size = i

        all_layers.append(nn.Linear(layers[-1], output_size))

        self.layers = nn.Sequential(*all_layers)

    def forward(self, x_categorical, x_numerical):
        embeddings = []
        for i,e in enumerate(self.all_embeddings):
            embeddings.append(e(x_categorical[:,i]))
        x = torch.cat(embeddings, 1)
        x = self.embedding_dropout(x)

        x_numerical = self.batch_norm_num(x_numerical)
        x = torch.cat([x, x_numerical], 1)
        x = self.layers(x)
        return x

print('categorical_embedding_sizes:  ',categorical_embedding_sizes)
print(numerical_data.shape[1])

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3
categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
outputs:                 torch.Size([23250])

epoch:   1 loss: 0.78628689
epoch:  31 loss: 0.61815375
epoch:  61 loss: 0.54198617
epoch:  91 loss: 0.42037284
epoch: 121 loss: 0.28172502
epoch: 151 loss: 0.16479240
epoch: 181 loss: 0.11258653
epoch: 211 loss: 0.10971391
epoch: 241 loss: 0.09200522
epoch: 271 loss: 0.08967553
epoch: 300 loss: 0.0847498849

Loss train_set: 0.08532631

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

model = Model(categorical_embedding_sizes, numerical_data.shape[1], 2, [200,100,50], p=0.4)

print(model)

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3
categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
outputs:                 torch.Size([23250])

epoch:   1 loss: 0.78628689
epoch:  31 loss: 0.61815375
epoch:  61 loss: 0.54198617
epoch:  91 loss: 0.42037284
epoch: 121 loss: 0.28172502
epoch: 151 loss: 0.16479240
epoch: 181 loss: 0.11258653
epoch: 211 loss: 0.10971391
epoch: 241 loss: 0.09200522
epoch: 271 loss: 0.08967553
epoch: 300 loss: 0.0847498849

Loss train_set: 0.08532631

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Tworzenie dunkcji straty¶

#loss_function = torch.nn.MSELoss(reduction='sum')
loss_function = nn.CrossEntropyLoss()
#loss_function = nn.BCEWithLogitsLoss()

Definiowanie optymalizatora¶

optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
#optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
#optimizer = torch.optim.Rprop(model.parameters(), lr=0.001, etas=(0.5, 1.2), step_sizes=(1e-06, 50))

print('categorical_embedding_sizes:  ',categorical_embedding_sizes)
print(numerical_data.shape[1])
print('categorical_train_data: ',categorical_train_data.shape)
print('numerical_train_data:   ',numerical_train_data.shape)
print('outputs:                ',train_outputs.shape)

categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3
categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
outputs:                 torch.Size([23250])

epoch:   1 loss: 0.78628689
epoch:  31 loss: 0.61815375
epoch:  61 loss: 0.54198617
epoch:  91 loss: 0.42037284
epoch: 121 loss: 0.28172502
epoch: 151 loss: 0.16479240
epoch: 181 loss: 0.11258653
epoch: 211 loss: 0.10971391
epoch: 241 loss: 0.09200522
epoch: 271 loss: 0.08967553
epoch: 300 loss: 0.0847498849

Loss train_set: 0.08532631

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

y_pred = model(categorical_train_data, numerical_train_data)

epochs = 300
aggregated_losses = []

for i in range(epochs):
    i += 1
    y_pred = model(categorical_train_data, numerical_train_data)
    
    single_loss = loss_function(y_pred, train_outputs)
    aggregated_losses.append(single_loss)

    if i
        print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')

    optimizer.zero_grad()
    single_loss.backward()
    optimizer.step()

print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')

epoch:   1 loss: 0.78628689
epoch:  31 loss: 0.61815375
epoch:  61 loss: 0.54198617
epoch:  91 loss: 0.42037284
epoch: 121 loss: 0.28172502
epoch: 151 loss: 0.16479240
epoch: 181 loss: 0.11258653
epoch: 211 loss: 0.10971391
epoch: 241 loss: 0.09200522
epoch: 271 loss: 0.08967553
epoch: 300 loss: 0.0847498849

Loss train_set: 0.08532631

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])

plt.plot(range(epochs), aggregated_losses)
plt.ylabel('Loss')
plt.xlabel('epoch');

Loss train_set: 0.08532631

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])

tensor([[ 87.9600,  39.2000,  58.0932],
        [235.8500,  40.1000,  57.0329],
        [ 80.8100,  33.2000,  34.1260],
        ...,
        [139.9900,  26.8000,  20.0493],
        [ 61.3100,  33.1000,  29.0575],
        [ 92.2200,  35.0000,  21.0603]])

Prognoza na podstawie modelu¶

with torch.no_grad():
    y_val_train = model(categorical_train_data, numerical_train_data)
    loss = loss_function( y_val_train, train_outputs)
print(f'Loss train_set: {loss:.8f}')

Loss train_set: 0.08532631

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])

tensor([[ 87.9600,  39.2000,  58.0932],
        [235.8500,  40.1000,  57.0329],
        [ 80.8100,  33.2000,  34.1260],
        ...,
        [139.9900,  26.8000,  20.0493],
        [ 61.3100,  33.1000,  29.0575],
        [ 92.2200,  35.0000,  21.0603]])

with torch.no_grad():
    y_val = model(categorical_test_data, numerical_test_data)
    loss = loss_function(y_val, test_outputs)
print(f'Loss: {loss:.8f}')

Loss: 0.10788266

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])

tensor([[ 87.9600,  39.2000,  58.0932],
        [235.8500,  40.1000,  57.0329],
        [ 80.8100,  33.2000,  34.1260],
        ...,
        [139.9900,  26.8000,  20.0493],
        [ 61.3100,  33.1000,  29.0575],
        [ 92.2200,  35.0000,  21.0603]])

tensor([[ 2.0236, -1.7279],
        [ 1.7250, -1.4366],
        [ 2.3650, -1.8698],
        [ 1.7376, -1.3888],
        [ 2.2012, -1.8784],
        [ 2.3820, -2.0166],
        [ 2.4481, -2.0788],
        [ 2.2441, -1.9123],
        [ 2.2570, -1.9358],
        [ 2.4153, -2.0513]], grad_fn=)

Ponieważ ustaliliśmy, że nasza warstwa wyjściowa będzie zawierać 2 neurony, każda prognoza będzie zawierać 2 wartości. Przykładowo pierwsze 5 przewidywanych wartości wygląda następująco:

print(y_val[:5])

tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])

tensor([[ 87.9600,  39.2000,  58.0932],
        [235.8500,  40.1000,  57.0329],
        [ 80.8100,  33.2000,  34.1260],
        ...,
        [139.9900,  26.8000,  20.0493],
        [ 61.3100,  33.1000,  29.0575],
        [ 92.2200,  35.0000,  21.0603]])

tensor([[ 2.0236, -1.7279],
        [ 1.7250, -1.4366],
        [ 2.3650, -1.8698],
        [ 1.7376, -1.3888],
        [ 2.2012, -1.8784],
        [ 2.3820, -2.0166],
        [ 2.4481, -2.0788],
        [ 2.2441, -1.9123],
        [ 2.2570, -1.9358],
        [ 2.4153, -2.0513]], grad_fn=)

Loss train_set: 0.10623065

Celem takich prognoz jest to, że jeśli rzeczywisty wynik wynosi 0, wartość przy indeksie 0 powinna być wyższa niż wartość przy indeksie 1 i odwrotnie. Możemy pobrać indeks największej wartości z listy za pomocą następującego skryptu:

y_val = np.argmax(y_val, axis=1)

Powyższe równanie zwraca wskaźniki wartości maksymalnych wzdłuż osi.

print(y_val[:195])

tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])

tensor([[ 87.9600,  39.2000,  58.0932],
        [235.8500,  40.1000,  57.0329],
        [ 80.8100,  33.2000,  34.1260],
        ...,
        [139.9900,  26.8000,  20.0493],
        [ 61.3100,  33.1000,  29.0575],
        [ 92.2200,  35.0000,  21.0603]])

tensor([[ 2.0236, -1.7279],
        [ 1.7250, -1.4366],
        [ 2.3650, -1.8698],
        [ 1.7376, -1.3888],
        [ 2.2012, -1.8784],
        [ 2.3820, -2.0166],
        [ 2.4481, -2.0788],
        [ 2.2441, -1.9123],
        [ 2.2570, -1.9358],
        [ 2.4153, -2.0513]], grad_fn=)

Loss train_set: 0.10623065

[[453   0]
 [ 12   0]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       453
           1       0.00      0.00      0.00        12

    accuracy                           0.97       465
   macro avg       0.49      0.50      0.49       465
weighted avg       0.95      0.97      0.96       465

0.9741935483870968

Ponieważ na liście pierwotnie przewidywanych wyników dla pierwszych pięciu rekordów wartości przy zerowych indeksach są większe niż wartości przy pierwszych indeksach, możemy zobaczyć 0 w pierwszych pięciu wierszach przetworzonych danych wyjściowych.

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(test_outputs,y_val))
print(classification_report(test_outputs,y_val))
print(accuracy_score(test_outputs, y_val))

[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694

/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])

tensor([[ 87.9600,  39.2000,  58.0932],
        [235.8500,  40.1000,  57.0329],
        [ 80.8100,  33.2000,  34.1260],
        ...,
        [139.9900,  26.8000,  20.0493],
        [ 61.3100,  33.1000,  29.0575],
        [ 92.2200,  35.0000,  21.0603]])

tensor([[ 2.0236, -1.7279],
        [ 1.7250, -1.4366],
        [ 2.3650, -1.8698],
        [ 1.7376, -1.3888],
        [ 2.2012, -1.8784],
        [ 2.3820, -2.0166],
        [ 2.4481, -2.0788],
        [ 2.2441, -1.9123],
        [ 2.2570, -1.9358],
        [ 2.4153, -2.0513]], grad_fn=)

Loss train_set: 0.10623065

[[453   0]
 [ 12   0]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       453
           1       0.00      0.00      0.00        12

    accuracy                           0.97       465
   macro avg       0.49      0.50      0.49       465
weighted avg       0.95      0.97      0.96       465

0.9741935483870968

Pomiar czasu wykonania tego zadania:
1122.9189233779907

Model słabo wykrywa udar.

Zapisujemy cały model¶

torch.save(model,'/home/wojciech/Pulpit/3/byk.pb')

/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])

tensor([[ 87.9600,  39.2000,  58.0932],
        [235.8500,  40.1000,  57.0329],
        [ 80.8100,  33.2000,  34.1260],
        ...,
        [139.9900,  26.8000,  20.0493],
        [ 61.3100,  33.1000,  29.0575],
        [ 92.2200,  35.0000,  21.0603]])

tensor([[ 2.0236, -1.7279],
        [ 1.7250, -1.4366],
        [ 2.3650, -1.8698],
        [ 1.7376, -1.3888],
        [ 2.2012, -1.8784],
        [ 2.3820, -2.0166],
        [ 2.4481, -2.0788],
        [ 2.2441, -1.9123],
        [ 2.2570, -1.9358],
        [ 2.4153, -2.0513]], grad_fn=)

Loss train_set: 0.10623065

[[453   0]
 [ 12   0]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       453
           1       0.00      0.00      0.00        12

    accuracy                           0.97       465
   macro avg       0.49      0.50      0.49       465
weighted avg       0.95      0.97      0.96       465

0.9741935483870968

Pomiar czasu wykonania tego zadania:
1122.9189233779907

Odtwarzamy cały model¶

KOT = torch.load('/home/wojciech/Pulpit/3/byk.pb')
KOT.eval()

Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

Podstawiając inne zmienne niezależne można uzyskać wektor zmiennych wyjściowych¶

A = categorical_train_data[::50]
A

tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])

B = numerical_train_data[::50]
B

tensor([[ 87.9600,  39.2000,  58.0932],
        [235.8500,  40.1000,  57.0329],
        [ 80.8100,  33.2000,  34.1260],
        ...,
        [139.9900,  26.8000,  20.0493],
        [ 61.3100,  33.1000,  29.0575],
        [ 92.2200,  35.0000,  21.0603]])

y =train_outputs[::50]

y_pred_AB = KOT(A, B)
y_pred_AB[:10]

tensor([[ 2.0236, -1.7279],
        [ 1.7250, -1.4366],
        [ 2.3650, -1.8698],
        [ 1.7376, -1.3888],
        [ 2.2012, -1.8784],
        [ 2.3820, -2.0166],
        [ 2.4481, -2.0788],
        [ 2.2441, -1.9123],
        [ 2.2570, -1.9358],
        [ 2.4153, -2.0513]], grad_fn=)

with torch.no_grad():
    y_val_AB = KOT(A,B)
    loss = loss_function( y_val_AB, y)
print(f'Loss train_set: {loss:.8f}')

Loss train_set: 0.10623065

[[453   0]
 [ 12   0]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       453
           1       0.00      0.00      0.00        12

    accuracy                           0.97       465
   macro avg       0.49      0.50      0.49       465
weighted avg       0.95      0.97      0.96       465

0.9741935483870968

Pomiar czasu wykonania tego zadania:
1122.9189233779907

y_val = np.argmax(y_val_AB, axis=1)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y,y_val))
print(classification_report(y,y_val))
print(accuracy_score(y, y_val))

[[453   0]
 [ 12   0]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       453
           1       0.00      0.00      0.00        12

    accuracy                           0.97       465
   macro avg       0.49      0.50      0.49       465
weighted avg       0.95      0.97      0.96       465

0.9741935483870968

Pomiar czasu wykonania tego zadania:
1122.9189233779907

print('Pomiar czasu wykonania tego zadania:')
print(time.time() - start_time) ## koniec pomiaru czasu

Pomiar czasu wykonania tego zadania:
1122.9189233779907

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


df= pd.read_csv('c:/1/Stroke_Prediction.csv')
df.head(5)

1. Sprawdzenie kompletności i formatu danych¶

df.isnull().sum()

ID                    0
Gender                0
Age_In_Days           0
Hypertension          0
Heart_Disease         0
Ever_Married          0
Type_Of_Work          0
Residence             0
Avg_Glucose           0
BMI                1462
Smoking_Status    13292
Stroke                0
dtype: int64

Brakuje danych dla:

- BMI
- Smoking_Status

Struktura braków: BMI i Smoking_Status¶

import seaborn as sns

print('obserwacji zmiennych: ',df.shape)
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

obserwacji zmiennych:  (43400, 12)

Source:  https://www.poradnikzdrowie.pl/sprawdz-sie/kalkulatory/kalkulator-wagi-bmi-aa-4Q8M-4h3E-dtKD.html

max_BMI:  400.0
min_BMI:  6.198347107438016

never smoked       16053
NaN                13292
formerly smoked     7493
smokes              6562
Name: Smoking_Status, dtype: int64

never smoked       0.369885
NaN                0.306267
formerly smoked    0.172650
smokes             0.151198
Name: Smoking_Status, dtype: float64

ID               0.003067
Ss_nowa          0.019140
BMI              0.020285
Hypertension     0.075332
Avg_Glucose      0.078917
Heart_Disease    0.113763
Age_In_Days      0.153703
Stroke           1.000000
Name: Stroke, dtype: float64

Przed kasowaniem:  (43400, 12)
Po kasowaniu:      (29072, 12)

Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Analiza BMI (Body Mass Index)¶

BMI<18,5 > nadwaga
18,5<=BMI<=24,9 > waga prawidłowa
25<=BMI <=29,9 > nadwaga
BMI>30 > otyłość

a = r'BMI = frac{masa}{{wzrost}^{2}}'
ax = plt.axes([0,0,0.3,0.3]) #left,bottom,width,height
ax.set_xticks([])
ax.set_yticks([])
ax.axis('off')
plt.text(0.4,0.4,'$

Source:  https://www.poradnikzdrowie.pl/sprawdz-sie/kalkulatory/kalkulator-wagi-bmi-aa-4Q8M-4h3E-dtKD.html

max_BMI:  400.0
min_BMI:  6.198347107438016

never smoked       16053
NaN                13292
formerly smoked     7493
smokes              6562
Name: Smoking_Status, dtype: int64

never smoked       0.369885
NaN                0.306267
formerly smoked    0.172650
smokes             0.151198
Name: Smoking_Status, dtype: float64

ID               0.003067
Ss_nowa          0.019140
BMI              0.020285
Hypertension     0.075332
Avg_Glucose      0.078917
Heart_Disease    0.113763
Age_In_Days      0.153703
Stroke           1.000000
Name: Stroke, dtype: float64

Przed kasowaniem:  (43400, 12)
Po kasowaniu:      (29072, 12)

Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Przed kasowaniem:  (29072, 12)
Po kasowaniu:      (29065, 12)

Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)

Sprawdzam, czy są błędne dane we wskaźniku BMI (Body Mass Index)¶

Wartość minimalna dla BMI: Zakładam, że ludzie mają minimalnie 100 cm wzrostu i ważą maksymalnie 300 kg
Wartość maksymalna dla BMI: Zakładam, że ludzie mają maksymalnie 230 cm wzrostu i ważą minimalnie 20 kg

max_BMI=400/(1*1)
min_BMI=30/(2.20*2.20)

print('max_BMI: ', max_BMI)
print('min_BMI: ', min_BMI)

max_BMI:  400.0
min_BMI:  6.198347107438016

never smoked       16053
NaN                13292
formerly smoked     7493
smokes              6562
Name: Smoking_Status, dtype: int64

never smoked       0.369885
NaN                0.306267
formerly smoked    0.172650
smokes             0.151198
Name: Smoking_Status, dtype: float64

ID               0.003067
Ss_nowa          0.019140
BMI              0.020285
Hypertension     0.075332
Avg_Glucose      0.078917
Heart_Disease    0.113763
Age_In_Days      0.153703
Stroke           1.000000
Name: Stroke, dtype: float64

Przed kasowaniem:  (43400, 12)
Po kasowaniu:      (29072, 12)

Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Przed kasowaniem:  (29072, 12)
Po kasowaniu:      (29065, 12)

Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)

df[(df['BMI']<=10)&(df['BMI']>=300)]

Brak danych zafałszowanych w kolumnie BMI.

Sprawdzam, jaka jest struktura danych.

BMI1 = pd.qcut(df['BMI'],12)
BMI1.value_counts(dropna = False).sort_values(ascending=False).plot(kind='bar')

never smoked       16053
NaN                13292
formerly smoked     7493
smokes              6562
Name: Smoking_Status, dtype: int64

never smoked       0.369885
NaN                0.306267
formerly smoked    0.172650
smokes             0.151198
Name: Smoking_Status, dtype: float64

ID               0.003067
Ss_nowa          0.019140
BMI              0.020285
Hypertension     0.075332
Avg_Glucose      0.078917
Heart_Disease    0.113763
Age_In_Days      0.153703
Stroke           1.000000
Name: Stroke, dtype: float64

Przed kasowaniem:  (43400, 12)
Po kasowaniu:      (29072, 12)

Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Przed kasowaniem:  (29072, 12)
Po kasowaniu:      (29065, 12)

Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)

0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

#df.BMI.value_counts(dropna = False)

import matplotlib.dates as mdates

fig, ax = plt.subplots()
df['BMI'].plot.kde(ax=ax, legend=False, title='Histogram: BMI')
df['BMI'].plot.hist(density=True, ax=ax)

ax.set_ylabel('Probability')
ax.grid(axis='y')
#ax.set_facecolor('#d8dcd6')

never smoked       16053
NaN                13292
formerly smoked     7493
smokes              6562
Name: Smoking_Status, dtype: int64

never smoked       0.369885
NaN                0.306267
formerly smoked    0.172650
smokes             0.151198
Name: Smoking_Status, dtype: float64

ID               0.003067
Ss_nowa          0.019140
BMI              0.020285
Hypertension     0.075332
Avg_Glucose      0.078917
Heart_Disease    0.113763
Age_In_Days      0.153703
Stroke           1.000000
Name: Stroke, dtype: float64

Przed kasowaniem:  (43400, 12)
Po kasowaniu:      (29072, 12)

Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Przed kasowaniem:  (29072, 12)
Po kasowaniu:      (29065, 12)

Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)

0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

W mojej ocenie brak możliwości odtworzenia brakujących wartości BMI. Należy więc skasować rekordy z brakującymi wartościami BMI.

Analiza Smoking_Status¶

df['Smoking_Status'].value_counts(dropna = False)

never smoked       16053
NaN                13292
formerly smoked     7493
smokes              6562
Name: Smoking_Status, dtype: int64

Podobnie nie ma możliwości uzupełnienia brakujących wartości zmienej niezależnej: Smoking_Status na podstawie zachowania pozostałych zmiennych niezależnych.

Zmienna ta musi mieć trzy stany:

never smoked,
formerly smoked,
smokes.

Pozostawienie czwartego stanu NaN byłoby błędem.

df['Smoking_Status'].value_counts(normalize=True,dropna = False)

never smoked       0.369885
NaN                0.306267
formerly smoked    0.172650
smokes             0.151198
Name: Smoking_Status, dtype: float64

Jak ważna jest informacja 'Smoking_Status’ i 'BMI’ dla zmiennej wynikowej? Sprawdzam to, ponieważ istnieje możliwość eliminacji całych zmiennych. Przy okazji zbadamy korelacje pozostałych zmiennych egzogenicznych ze zmienną endogeniczną.

df['Ss_nowa'] = pd.Categorical(df['Smoking_Status']).codes

CORREL = df.corr().sort_values('Stroke')
print(CORREL['Stroke'])
del df['Ss_nowa']

ID               0.003067
Ss_nowa          0.019140
BMI              0.020285
Hypertension     0.075332
Avg_Glucose      0.078917
Heart_Disease    0.113763
Age_In_Days      0.153703
Stroke           1.000000
Name: Stroke, dtype: float64

Przed kasowaniem:  (43400, 12)
Po kasowaniu:      (29072, 12)

Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Przed kasowaniem:  (29072, 12)
Po kasowaniu:      (29065, 12)

Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)

0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

Yes    0.746198
No     0.253802
Name: Ever_Married, dtype: float64

Private          0.651985
Self-employed    0.179065
Govt_job         0.144312
children         0.021162
Never_worked     0.003475
Name: Type_Of_Work, dtype: float64

print('Przed kasowaniem: ',df.shape)
df = df.dropna(how='any')
print('Po kasowaniu:     ',df.shape)

Przed kasowaniem:  (43400, 12)
Po kasowaniu:      (29072, 12)

Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Przed kasowaniem:  (29072, 12)
Po kasowaniu:      (29065, 12)

Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)

0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

Yes    0.746198
No     0.253802
Name: Ever_Married, dtype: float64

Private          0.651985
Self-employed    0.179065
Govt_job         0.144312
children         0.021162
Never_worked     0.003475
Name: Type_Of_Work, dtype: float64

Urban    0.502099
Rural    0.497901
Name: Residence, dtype: float64

print('Przed kasowaniem: ',df.shape)
df = df.dropna(how='any')
print('Po kasowaniu:     ',df.shape)

Przed kasowaniem:  (43400, 12)
Po kasowaniu:      (29072, 12)

Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Przed kasowaniem:  (29072, 12)
Po kasowaniu:      (29065, 12)

Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)

0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

Yes    0.746198
No     0.253802
Name: Ever_Married, dtype: float64

Private          0.651985
Self-employed    0.179065
Govt_job         0.144312
children         0.021162
Never_worked     0.003475
Name: Type_Of_Work, dtype: float64

Urban    0.502099
Rural    0.497901
Name: Residence, dtype: float64

df.head(5)

Analiza Gender¶

df['Gender'].value_counts(normalize=True,dropna = False)

Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Co to znaczy płeć 'Other’? Przedmiotem badania jest podatność na udar m.in pod kątem konkretnej płci. Płeć mózgu nie ma tu znaczenia. Przyjmuję, że 'Other’ to błąd danych i go kasuję. Pozostawienie trzeciego stanu 'Other’ byłoby błędem dla procesu klasyfikacji.

df['Gender'].replace('Other', np.nan, inplace=True)
print('Przed kasowaniem: ',df.shape)
df = df.dropna(how='any')
print('Po kasowaniu:     ',df.shape)

Przed kasowaniem:  (29072, 12)
Po kasowaniu:      (29065, 12)

Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)

0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

Yes    0.746198
No     0.253802
Name: Ever_Married, dtype: float64

Private          0.651985
Self-employed    0.179065
Govt_job         0.144312
children         0.021162
Never_worked     0.003475
Name: Type_Of_Work, dtype: float64

Urban    0.502099
Rural    0.497901
Name: Residence, dtype: float64

count    29062.000000
mean       106.408801
std         45.273649
min         55.010000
25
50
75
max        291.050000
Name: Avg_Glucose, dtype: float64

Analiza Age_In_Days¶

Wiek człowieka analizujemy w latach. Jednocześnie ludzie starzeją się nierównomiernie. Dlatego wskazane jest analizować pacjentów według grup wiekowych.

df['Age_years']= df['Age_In_Days']/365

Sprawdzam, czy zmienna 'Age_years’ ma prawidłowe wartości. Okazuje się, że trzech pacjentów ma wiek powyżej 200 lat. Kasujemy te rekordy.

df[df['Age_years']>120]

df['Age_years'] = df['Age_years'].apply(lambda x: np.nan if x > 120 else x)
print('Przed kasowaniem: ',df.shape)
df = df.dropna(how='any')
print('Po kasowaniu:     ',df.shape)
df[df['Age_years']>120]

Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)

0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

Yes    0.746198
No     0.253802
Name: Ever_Married, dtype: float64

Private          0.651985
Self-employed    0.179065
Govt_job         0.144312
children         0.021162
Never_worked     0.003475
Name: Type_Of_Work, dtype: float64

Urban    0.502099
Rural    0.497901
Name: Residence, dtype: float64

count    29062.000000
mean       106.408801
std         45.273649
min         55.010000
25
50
75
max        291.050000
Name: Avg_Glucose, dtype: float64

0    0.981144
1    0.018856
Name: Stroke, dtype: float64

Podzieliłem wiek na 10 grup wiekowych. Kasuję kolumnę 'Age_In_Days’.

del df['Age_In_Days']
df['Age_years_10']= pd.qcut(df['Age_years'],10)
df['Age_years_10'].value_counts(normalize=True,dropna = False).sort_values(ascending=False).plot(kind='bar')

0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

Yes    0.746198
No     0.253802
Name: Ever_Married, dtype: float64

Private          0.651985
Self-employed    0.179065
Govt_job         0.144312
children         0.021162
Never_worked     0.003475
Name: Type_Of_Work, dtype: float64

Urban    0.502099
Rural    0.497901
Name: Residence, dtype: float64

count    29062.000000
mean       106.408801
std         45.273649
min         55.010000
25
50
75
max        291.050000
Name: Avg_Glucose, dtype: float64

0    0.981144
1    0.018856
Name: Stroke, dtype: float64

Index(['ID', 'Gender', 'Hypertension', 'Heart_Disease', 'Ever_Married',
       'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI', 'Smoking_Status',
       'Stroke', 'Age_years', 'Age_years_10'],
      dtype='object')

Analiza Hypertension¶

df['Hypertension'].value_counts(normalize=True,dropna = False)

0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

Nadciśnienie występuje u 11

df['BMI_5']= pd.qcut(df['BMI'],5)
df.pivot_table(index='BMI_5', columns = 'Hypertension', values='Age_years',aggfunc='count')

df['BMI_5'] = df['BMI_5'].astype(object)

plt.style.use('seaborn')

table=pd.crosstab(df['BMI_5'],df['Hypertension'])
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True, fontsize=14)
plt.title('BMI vs Hypertension', fontsize=20)
plt.xlabel('Grupy_BMI')
plt.ylabel('Proporcja')

del df['BMI_5']

0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

Yes    0.746198
No     0.253802
Name: Ever_Married, dtype: float64

Private          0.651985
Self-employed    0.179065
Govt_job         0.144312
children         0.021162
Never_worked     0.003475
Name: Type_Of_Work, dtype: float64

Urban    0.502099
Rural    0.497901
Name: Residence, dtype: float64

count    29062.000000
mean       106.408801
std         45.273649
min         55.010000
25
50
75
max        291.050000
Name: Avg_Glucose, dtype: float64

0    0.981144
1    0.018856
Name: Stroke, dtype: float64

Index(['ID', 'Gender', 'Hypertension', 'Heart_Disease', 'Ever_Married',
       'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI', 'Smoking_Status',
       'Stroke', 'Age_years', 'Age_years_10'],
      dtype='object')

C:ProgramDataAnaconda3libsite-packagesstatsmodelsnonparametrickde.py:488: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:ProgramDataAnaconda3libsite-packagesstatsmodelsnonparametrickdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2

Zmienna egzogeniczna 'Hypertension’ jest prawidłowa i zachowuje się zgodnie z powszechnym poglądem, że im większe BMI, tym większe nadciśnienie tętnicze.

Analiza Heart_Disease¶

df['Heart_Disease'].value_counts(normalize=True,dropna = False)

0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

Analiza Ever_Married¶

df['Ever_Married'].value_counts(normalize=True,dropna = False)

Yes    0.746198
No     0.253802
Name: Ever_Married, dtype: float64

Analiza Type_Of_Work¶

df['Type_Of_Work'].value_counts(normalize=True,dropna = False)

Private          0.651985
Self-employed    0.179065
Govt_job         0.144312
children         0.021162
Never_worked     0.003475
Name: Type_Of_Work, dtype: float64

Analiza Residence¶

df['Residence'].value_counts(normalize=True,dropna = False)

Urban    0.502099
Rural    0.497901
Name: Residence, dtype: float64

Analiza Avg_Glucose¶

df['Avg_Glucose'].describe()

count    29062.000000
mean       106.408801
std         45.273649
min         55.010000
25
50
75
max        291.050000
Name: Avg_Glucose, dtype: float64

df['Avg_Glucose'].plot.kde()

0    0.981144
1    0.018856
Name: Stroke, dtype: float64

Index(['ID', 'Gender', 'Hypertension', 'Heart_Disease', 'Ever_Married',
       'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI', 'Smoking_Status',
       'Stroke', 'Age_years', 'Age_years_10'],
      dtype='object')

C:ProgramDataAnaconda3libsite-packagesstatsmodelsnonparametrickde.py:488: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:ProgramDataAnaconda3libsite-packagesstatsmodelsnonparametrickdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2

ID                0
Gender            0
Hypertension      0
Heart_Disease     0
Ever_Married      0
Type_Of_Work      0
Residence         0
Avg_Glucose       0
BMI               0
Smoking_Status    0
Stroke            0
Age_years         0
Age_years_10      0
dtype: int64

Rozkład gęstości prawdopodobieństwa zmiennej: 'Avg_Glucose’ posiada anomalie, której nie będziemy wyjaśniali na tym etapie badania.

Analiza Stroke¶

df['Stroke'].value_counts(normalize=True,dropna = False)

0    0.981144
1    0.018856
Name: Stroke, dtype: float64

Zmienna wynikowa 'Stroke’ jest bardzo niezbilansowana. Z ciekawości zerkniemy, czy pojawia się jakiś wzór zależności zmiennej zależnej i zmiennych niezależnych.

Analiza ogólna zmiennych¶

Wykonuje się ją w ceu sprawdzenia, czy zachowanie zmiennych jest zgodne z ogólnie znaną wiedzą.

df.columns

Index(['ID', 'Gender', 'Hypertension', 'Heart_Disease', 'Ever_Married',
       'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI', 'Smoking_Status',
       'Stroke', 'Age_years', 'Age_years_10'],
      dtype='object')

kot = ["#c0c2ce", "#e40c2b"]
sns.pairplot(data=df[[ 'Avg_Glucose', 'BMI', 'Stroke', 'Age_years']], hue='Stroke', dropna=True, palette=kot)

C:ProgramDataAnaconda3libsite-packagesstatsmodelsnonparametrickde.py:488: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:ProgramDataAnaconda3libsite-packagesstatsmodelsnonparametrickdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2

ID                0
Gender            0
Hypertension      0
Heart_Disease     0
Ever_Married      0
Type_Of_Work      0
Residence         0
Avg_Glucose       0
BMI               0
Smoking_Status    0
Stroke            0
Age_years         0
Age_years_10      0
dtype: int64

Wstępna analiza danych ciągłych na powyższym wykresie wykazała, że:

prawdopodobieństwo udaru rośnie wraz z wiekiem,
udar najczęściej występuje w przedziale BMI 20-50,
poziom glukozy wydaje się nie mieć znaczenia.

Zachowanie zmiennych potwierdza ogólnie znaną wiedzę.

Zapisanie oczyszczonego i poprawionego zbioru danych do dalszych analiz¶

df.head(3)

df.isnull().sum()

ID                0
Gender            0
Hypertension      0
Heart_Disease     0
Ever_Married      0
Type_Of_Work      0
Residence         0
Avg_Glucose       0
BMI               0
Smoking_Status    0
Stroke            0
Age_years         0
Age_years_10      0
dtype: int64

df.to_csv('c:/1/Stroke_Prediction_CLEAR.csv')

End of Part_1: Stroke_Prediction – Preparation of data for analysis

Part_2 Stroke_Prediction – Preparation of data for the classification process

Opisywanie statystyczne	`df.describe()`
Opisywanie statystyczne tylko określonych typów kolumn (np. tylko typu: 'object’)	df.describe(include= 'float64′) df.describe(include=’object’)
Opisywanie statystyczne tylko określonych typów kolumn (np. tylko typu: 'number’)	df.describe(include=[np.number])
Wyświetla tylko kolumny typu 'object’	df.describe(include=[„object”]).columns
osdfiltrowanie zmiennych dyskretnych do innego dataframe	cat_sdf = df.select_dtypes(include=[’object’]).copy()
Zaokrąglenie	print({round(celsius, 2)})
Wydrukowanie bez wartości setnych, wydruk z zaokrągleniem	print(’Kendall correlation coefficient:
zobaczyć jaki typ danych mają kolumny	df.dtypes
Wyświetanie typu danych np.number	df.select_dtypes(include=[np.number]) df.select_dtypes(’object’) df.select_dtypes(’float’)
ile jest pustych komórek NaN	df.isnull().sum()
Pokazać wszystkie brakujące komórki graficznie w Seaborn fioletowy wykres, wykres braków	import seaborn as sns import matplotlib.pyplot as plt plt.figure(figsize=(10,8)) sns.heatmap(sdf.isnull(),yticklabels=False,cbar=False,cmap=’viridis’)
wiersze z brakującymi danymi w kolumnie AAA (z loc)	df.loc[sdf.AAA.isnull(), :] df[sdf[’Shape Reported’].isnull()]
Ile jest wartości pustych NaN w kolumnie AAA.	df.AAA.isnull().sum()
Pokazywanie outlayersów, wartości odstających, kropki to są wartości odstające!	data.plot(kind=”box”,subplots=True,figsize=(15,5),title=”Data with Outliers”)
Funkcja usuwająca outleyery, usuwa wartości odstające	`def outlier_removal(X,factor): # factor np. 1.5` `X = pd.DataFrame(X).copy()` `for i in range(X.shape[1]):` `x = pd.Series(X.iloc[:,i]).copy()` `q1 = x.quantile(0.25)` `q3 = x.quantile(0.75)` `iqr = q3 - q1` `lower_bound = q1 - (factor * iqr)` `upper_bound = q3 + (factor * iqr)` `X.iloc[((X.iloc[:,i] < lower_bound) \| (X.iloc[:,i] > upper_bound)),i] = np.nan` `return X`

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	17463	51.8625	E46	S

	Unnamed: 0	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
3	3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
6	6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	17463	51.8625	E46	S

	Unnamed: 0	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	PC 17599	71.2833	C85	C
3	3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
6	6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	17463	51.8625	E46	S

Hypertension	0	1
BMI_5
(10.099, 24.1]	5605	290
(24.1, 27.4]	5352	506
(27.4, 30.7]	5160	691
(30.7, 35.3]	4893	810
(35.3, 92.0]	4811	944

	ID	Gender	Age_In_Days	Hypertension	Heart_Disease	Ever_Married	Type_Of_Work	Residence	Avg_Glucose	BMI	Smoking_Status	Stroke
0	31153	Male	1104.0	0	0	No	children	Rural	95.12	18.0	NaN	0
1	30650	Male	21204.0	1	0	Yes	Private	Urban	87.96	39.2	never smoked	0

	ID	Gender	Age_In_Days	Ever_Married	Type_Of_Work	Residence	Avg_Glucose	BMI	Smoking_Status	Age_years
1342	58414	Female	85451.0	No	Private	Rural	65.30	22.1	smokes	234.112329
18177	31212	Female	117179.0	Yes	Govt_job	Rural	84.39	38.9	never smoked	321.038356
26716	70730	Female	79231.0	No	Private	Rural	77.62	23.1	formerly smoked	217.071233

Uncategorized - THE DATA SCIENCE LIBRARY

Zabezpieczone: szkolenie Maksa

A Recommender System in the Mill (Part 1) Building on Azure Synapse

Data Sources

Azure-Based Recommendation System

Data Preparation

Algorithm Selection and Model Training

Serving Recommendations

Summary

panda

Stacking in Trident 1.1 [Stroke_Prediction.csv]

Define Base (level 0) and Stacking (level 1) estimators

Evaluate Base estimators separately

Create Hold Out predictions (meta-features)

Create meta-features for training data

Create meta-features for testing data

Predict on Stacking Classifier

OVERSAMPLING

Define Base (level 0) and Stacking (level 1) estimators¶

Evaluate Base estimators separately¶

Create Hold Out predictions (meta-features)¶

Create meta-features for training data¶

Create meta-features for testing data¶

Predict on Stacking Classifier¶

Threshold¶

Perfect model: Random forest classifier (1)

part 1: Determining the depth of trees by visualization using visualization¶

Digitizing data in page format¶

Selection of variables divided into test and training set¶

Replacing dataframe with array¶

Data normalization (standardization)¶

How Random Forest classifies according to the depth of the tree¶

Random Forest model, depth 4¶

Visualization of the Rendom Forest classification using trees 6 deep¶

We run a forest of 240 trees, 6 depth each¶

Increasing the number of trees for the variables 'Sex’ and 'Age’ has no effect over 100 trees.¶

How to use PCA in logistic regression?

Principal component analysis (PCA)

Analysis of the result variable’s balance level ¶

Split into test and result set ¶

Logistic regression model ¶

Principal component analysis (PCA)¶

PCA transformation of two variables¶

We substitute again for the logistic regression model¶

Cluster visualisation¶

For the training set¶

For the test set¶

PCA transformation of three variables¶

Variance of the top 3 variables¶

Again, we substitute for the logistic regression model¶

Part. 2 How to improve the classification model? Principal component analysis (PCA)

Digitizing data in page format¶

Selection of variables divided into test and training set¶

Data normalization (standardization)¶

Principal component analysis (PCA)¶

We’re looking for one, the best independent variable in the model¶

We’re looking for the two best independent variables in the model¶

PCA algorithm¶

We are looking for the three best independent variables in the model¶

Feature Selection Techniques – Random Forest Classifier

In [1]:

Simple classification model: Logistic regression model¶

We check which independent variables are the best¶

We sort variables by importance¶

We only use the most relevant variables: 'Sex’, 'Age’, 'Fare’, 'Ticket’¶

Simple classification model: Logistic regression model¶

Including the 'Pclass’ variable has compromised model properties!¶

Part_7 Stroke_Prediction – Model Sieci neuronowych PyTorch Technika Osadzania

Importuję dane¶

Uporządkowanie kolumn z danymi kategorycznymi i danymi ciągłymi¶

Ustalamy, że zmienną wynikową jest kolumna 'Stroke’¶

Cyfryzacja zmiennych tekstowych¶

Wprowadzam nowy typ danych: 'category’¶

Cyfryzacja danych¶

Dlaczego zcyfrowaliśmy dane w formacie?¶

Konwersja zmiennych kategorycznych na macierz Numpy¶

Tworzenie tensora Pytorch z macierzy Numpy¶

Konwersja kolumn numerycznych DataFrame na tensor Pytorch¶

Konwersja zmiennych wynikowych na tensor Pytorch¶

Podsumujmy tensory¶