Uncategorized - THE DATA SCIENCE LIBRARY http://sigmaquality.pl/category/uncategorized/ Wojciech Moszczyński Sun, 09 Nov 2025 13:33:29 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png Uncategorized - THE DATA SCIENCE LIBRARY http://sigmaquality.pl/category/uncategorized/ 32 32 Zabezpieczone: szkolenie Maksa https://sigmaquality.pl/doe/szkolenie-maksa/ Mon, 03 Nov 2025 15:55:16 +0000 https://sigmaquality.pl/?p=8872 Brak zajawki, ponieważ wpis jest zabezpieczony hasłem.

Artykuł Zabezpieczone: szkolenie Maksa pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Ta treść jest chroniona hasłem. Aby ją wyświetlić, wpisz poniżej swoje hasło:

Artykuł Zabezpieczone: szkolenie Maksa pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
A Recommender System in the Mill (Part 1) Building on Azure Synapse https://sigmaquality.pl/moje-publikacje/a-recommender-system-in-the-mill-part-1-building-on-azure-synapse/ Sat, 09 Aug 2025 11:15:57 +0000 https://sigmaquality.pl/?p=8595 We are building a recommendation system for an online food wholesaler offering organic flours made from exotic and rare grains. The store serves approximately 20,000 [...]

Artykuł A Recommender System in the Mill (Part 1) Building on Azure Synapse pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
We are building a recommendation system for an online food wholesaler offering organic flours made from exotic and rare grains. The store serves approximately 20,000 customers across Europe, and the system’s goal is to suggest products that customers are likely to enjoy and ultimately purchase.

A key assumption is that recommendations do not need to operate in real time—the model will be refreshed periodically (e.g., once a week) based on accumulated data.


Data Sources

We draw on three primary data sources:

  1. Customer profile database – containing static client information: gender, age, country and place of residence, type of business activity (e.g., whether the customer is a dietitian, baker, restaurateur, or an instructor running bread-baking workshops).

  2. Transaction database (sales register) – containing purchase history: which products were bought, when transactions occurred, etc. Each transaction is linked to a specific customer, allowing tracking of preferences and shopping habits.

  3. Application log (behavioral) database – processed data describing customer behavior in the store (e.g., site interactions, reactions to promotions). This includes emotional and behavioral attributes such as: tendency to respond to promotions, speed of purchase decisions, likelihood of abandoning the cart, susceptibility to recommendations, or propensity to click banners. These features are extracted from logs and stored in a structured form for each customer.

Such data enables the recommender system to tailor offers more effectively. For example, a “promotion-oriented” customer might respond better to discounts, while a dietitian might be more interested in gluten-free products. Based on the sales register, we can identify customer preferences for specific products.


Azure-Based Recommendation System

(Synapse + Azure ML with Microsoft Recommenders)

Azure does not provide a single, fully managed “out-of-the-box” recommendation service. Instead, we assemble a solution using several components:

  • Azure Synapse Analytics (or Azure Databricks) for data processing and model training.

  • Azure Machine Learning for experiment/model management, and optionally databases or services for serving results (e.g., Azure Cosmos DB, Azure SQL, or Azure Kubernetes Service + API).

  • Microsoft Solution Accelerators, such as the Moyo Azure Synapse Retail Recommender Solution, which provides an end-to-end retail product recommendation pipeline using Synapse (Spark), model training, and deployment as a service.

Concept: transactional data is processed in Synapse (Spark), the model is trained in Azure ML, and the recommendations are stored in a database or made available via an API. The entire pipeline can run in batch mode (e.g., weekly).


Data Preparation

We collect customer, product, and interaction data.

  • Customer database: gender, age, country, etc.

  • Application logs: behavioral metrics.

  • Sales transactions: the key link between users and purchased products, forming the core training data.

These datasets are loaded into Azure Data Lake and processed/joined in Synapse (Apache Spark) or Azure Data Factory to create model-ready tables. Product attributes (e.g., flour categories, grain type, region of origin) can also be included as item features for more advanced approaches.


Algorithm Selection and Model Training

Azure allows complete flexibility in recommendation methodology—collaborative filtering or content-based/hybrid approaches. Microsoft’s open-source Microsoft Recommenders library provides many algorithms and examples (e.g., ALS, Bayesian Personalized Ranking, SAR co-occurrence, sequential models, neural networks, LightGBM).

Collaborative Filtering (ALS)
The Moyo accelerator uses a matrix factorization ALS model trained on user–product transactional data (implicit feedback). After data cleaning and creating the interaction matrix, ALS in Spark MLlib is trained. The result is a model that predicts preference scores for each user–product pair based on latent vectors.

From this, we generate top-N recommendation lists per user (excluding already purchased products). These can be stored in Azure Cosmos DB and refreshed weekly by rerunning the pipeline. The same ALS model can also be used for item-to-item recommendations.

Content-Based / Hybrid
Alternatively, we can use user and product features to predict purchase likelihood. Microsoft proposes a LightGBM ranking model using:

  • user features (demographics, behavioral log indicators),

  • product features (category, grain type, etc.),

  • aggregated interaction stats (e.g., purchase counts per category).

Positive examples are historical purchases; negative samples can be generated. This allows recommending new products based on profile similarity without relying solely on co-purchase patterns.

Behavioral attributes from logs (e.g., “susceptibility to recommendations”) can also drive segmentation (e.g., k-means clustering) and influence how recommendations are ranked for different segments. In ALS, these features are not used during training but can be applied later for filtering or re-ranking.

In practice, one could train ALS on purchases and LightGBM on features, then combine results (ensemble) or select the better approach.


Serving Recommendations

In a weekly batch mode, it’s often enough to store per-user top-10 lists in a database or warehouse (e.g., Azure Cosmos DB JSON, SQL tables) for the website to display in “Recommended for You” sections. Synapse integrates with Cosmos DB (Synapse Link) for fast loading. Orchestration can be done via Azure ML Pipelines or Azure Data Factory/Synapse Pipelines.

For item-to-item recommendations (“Customers also bought…”), ALS similarity scores can be precomputed offline or exposed in real time via a REST API deployed on Azure Kubernetes Service (AKS) through Azure ML, integrated with Azure API Management.


Summary

Azure’s approach is highly flexible—algorithms can be tailored, and non-standard data (e.g., emotional attributes) can be incorporated in any way. It does, however, require more engineering work: preparing data, choosing/deploying algorithms, and setting up training/serving infrastructure (Synapse Spark or Azure ML).

Microsoft helps by providing sample code (Solution Accelerators, the Recommenders repo) and tight service integration (Synapse↔Cosmos DB, Azure ML→AKS).

W-Moszczynski (2)


Wojciech Moszczyński
Graduate of the Department of Econometrics and Statistics at Nicolaus Copernicus University in Toruń. Specialist in econometrics, finance, data science, and management accounting. Focused on optimizing production and logistics processes. Conducts research in AI development and applications. Actively promotes machine learning and data science in business environments.

Artykuł A Recommender System in the Mill (Part 1) Building on Azure Synapse pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
panda https://sigmaquality.pl/uncategorized/tabela/ Wed, 28 May 2025 17:40:08 +0000 https://sigmaquality.pl/?p=8548 Opisywanie statystyczne df.describe() Opisywanie statystyczne tylko określonych typów kolumn (np. tylko typu: 'object’) df.describe(include= 'float64′) df.describe(include=’object’) Opisywanie statystyczne tylko określonych typów kolumn (np. tylko typu: [...]

Artykuł panda pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Opisywanie statystyczne df.describe() Opisywanie statystyczne tylko określonych typów kolumn (np. tylko typu: 'object’) df.describe(include= 'float64′)
df.describe(include=’object’)
Opisywanie statystyczne tylko określonych typów kolumn (np. tylko typu: 'number’) df.describe(include=[np.number]) Wyświetla tylko kolumny typu 'object’ df.describe(include=[„object”]).columns osdfiltrowanie zmiennych dyskretnych do innego dataframe cat_sdf = df.select_dtypes(include=[’object’]).copy() Zaokrąglenie print({round(celsius, 2)}) Wydrukowanie bez wartości setnych, wydruk z zaokrągleniem print(’Kendall correlation coefficient: zobaczyć jaki typ danych mają kolumny df.dtypes Wyświetanie typu danych np.number df.select_dtypes(include=[np.number])
df.select_dtypes(’object’)
df.select_dtypes(’float’) ile jest pustych komórek NaN df.isnull().sum() Pokazać wszystkie brakujące komórki graficznie w Seaborn

fioletowy wykres, wykres braków import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(10,8))
sns.heatmap(sdf.isnull(),yticklabels=False,cbar=False,cmap=’viridis’)

wiersze z brakującymi danymi w kolumnie AAA (z loc) df.loc[sdf.AAA.isnull(), :]
df[sdf[’Shape Reported’].isnull()] Ile jest wartości pustych NaN w kolumnie AAA. df.AAA.isnull().sum() Pokazywanie outlayersów, wartości odstających, kropki to są wartości odstające! data.plot(kind=”box”,subplots=True,figsize=(15,5),title=”Data with Outliers”)

Funkcja usuwająca outleyery, usuwa wartości odstające def outlier_removal(X,factor): # factor np. 1.5
   X = pd.DataFrame(X).copy()
   for i in range(X.shape[1]):
      x = pd.Series(X.iloc[:,i]).copy()
      q1 = x.quantile(0.25)
      q3 = x.quantile(0.75)
      iqr = q3 - q1
      lower_bound = q1 - (factor * iqr)
      upper_bound = q3 + (factor * iqr)
      X.iloc[((X.iloc[:,i] < lower_bound) | (X.iloc[:,i] >             upper_bound)),i] = np.nan
 

return X        

Artykuł panda pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Stacking in Trident 1.1 [Stroke_Prediction.csv] https://sigmaquality.pl/uncategorized/stacking-in-trident-1-1-stroke_prediction-csv-090520201852/ Sat, 09 May 2020 16:54:58 +0000 http://sigmaquality.pl/stacking-in-trident-1-1-stroke_prediction-csv-090520201852/ 090520201852 Someone recently told me that I do not write enough so I will write: It is very nice when we have AUC of 85 [...]

Artykuł Stacking in Trident 1.1 [Stroke_Prediction.csv] pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
090520201852

Someone recently told me that I do not write enough so I will write:
It is very nice when we have AUC of 85 Unfortunately, this model is suitable for the trash and needs to be improved a bit. Now this is the model who was supposed to find who had a stroke and said no one had a stroke. Because there were 1-2

I got so hooked that I started to contribute to stackoverflow!

https://stackoverflow.com/questions/31417487/sklearn-logisticregression-and-changing-the-default-threshold-for-classification/61644649#61644649

In [1]:
# https://github.com/dawidkopczyk/blog/blob/master/stacking.py

# Classification Assessment
def Classification_Assessment(model ,Xtrain, ytrain, Xtest, ytest):
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    
    import scikitplot as skplt
    from plot_metric.functions import BinaryClassification
    from sklearn.metrics import precision_recall_curve

       
    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Confusion Matrix Test data")
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    
    ## ----------AUC-----------------------------------------
     
    print('---------------------') 
    AUC_train_1 = metrics.roc_auc_score(ytrain,model.predict_proba(Xtrain)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(ytest,model.predict_proba(Xtest)[:,1])
    print('AUC_test:  
    print('---------------------')    
    
    print("Accuracy Training data:     ", np.round(accuracy_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Accuracy Test data:         ", np.round(accuracy_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("----------------------------------------------------------------------")
    print('Valuation for test data only:')

    y_probas1 = model.predict_proba(Xtest)[:,1]
    y_probas2 = model.predict_proba(Xtest)

### ---plot_roc_curve--------------------------------------------------------
    plt.figure(figsize=(13,4))

    plt.subplot(1, 2, 1)
    bc = BinaryClassification(ytest, y_probas1, labels=["Class 1", "Class 2"])
    bc.plot_roc_curve() 


### --------precision_recall_curve------------------------------------------

    plt.subplot(1, 2, 2)
    precision, recall, thresholds = precision_recall_curve(ytest, y_probas1)

    plt.plot(recall, precision, marker='.', label=model)
    plt.title('Precision recall curve')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.legend(loc=(-0.30, -0.6))
    plt.show()

## ----------plot_roc-----------------------------------------

    skplt.metrics.plot_roc(ytest, y_probas2)
In [2]:
# General
import numpy as np

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB 

# Utilities
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score
from copy import copy as make_copy

from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from plot_metric.functions import BinaryClassification
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix


import warnings   
SEED = 2018
warnings.filterwarnings("ignore")
In [3]:
import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/Stroke_Prediction.csv')
print(df.shape)
df.head(2)
(43400, 12)
Out[3]:
ID Gender Age_In_Days Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke
0 31153 Male 1104.0 0 0 No children Rural 95.12 18.0 NaN 0
1 30650 Male 21204.0 1 0 Yes Private Urban 87.96 39.2 never smoked 0
In [4]:
import numpy as np

a,b = df.shape     #<- ile mamy kolumn
b


print('DISCRETE FUNCTIONS CODED')
print('------------------------')
for i in range(1,b):
    i = df.columns[i]
    f = df[i].dtypes
    if f == np.object:
        print(i,"---",f)   
    
        if f == np.object:
        
            df[i] = pd.Categorical(df[i]).codes
        
            continue
DISCRETE FUNCTIONS CODED
------------------------
Gender --- object
Ever_Married --- object
Type_Of_Work --- object
Residence --- object
Smoking_Status --- object
In [5]:
del df['ID']
df = df.dropna(how='any')
df.isnull().sum()
Out[5]:
Gender            0
Age_In_Days       0
Hypertension      0
Heart_Disease     0
Ever_Married      0
Type_Of_Work      0
Residence         0
Avg_Glucose       0
BMI               0
Smoking_Status    0
Stroke            0
dtype: int64
In [6]:
df.shape
Out[6]:
(41938, 11)
In [7]:
y = df['Stroke']
X = df.drop('Stroke', axis=1)
In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
# Jeżeli się rzuca wtedy wycinamy stratify=y.
In [9]:
y_train.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7f7a30b850>
In [10]:
X_train = X_train.values
X_test = X_test.values
y_train = y_train.values
y_test = y_test.values

Define Base (level 0) and Stacking (level 1) estimators

In [11]:
base_clf = [LogisticRegression(), RandomForestClassifier(), ### jakie model chce trenować
            AdaBoostClassifier(), GaussianNB()]

stck_clf = LogisticRegression()  ### układanie w stos odbyw się za pomocą LogisticRegression
#stck_clf = RandomForestClassifier()

Evaluate Base estimators separately

In [12]:
## Wstępna ocena bazowych estymatorów (modeli)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

for t in base_clf:
    
    # Set seed
    if 'kot' in t.get_params().keys():  # pobierz z modeli, które chce trenować kluczowe hiperparametry 
        t.set_params(random_state=SEED)  ## Podaje parametry PODSTAWOWE modelu, doyślne, fabryczne!!
                                           ## to znaczy, że jak podam specjalny hiperparament w modelu to będzie on uwzględniony             
    
    # Fit model
    t.fit(X_train, y_train) # Podstawiam do kolejnego modelu z pętli
    
    # Predict
    y_pred = t.predict(X_test)   #predykcja kolejnego modelu z pętli
    
    # Valuation
    acc = accuracy_score(y_test, y_pred)
    #pre = precision_score(y_test, y_pred,average = 'macro')
    #auc = roc_auc_score(y_test, y_pred)
    
    print('{} accuracy: {:.2f}
    
    plt.figure(figsize=(7,3))
    y_probas1 = t.predict_proba(X_test)[:,1]
    bc= BinaryClassification(y_test, y_probas1, labels=[t.__class__.__name__]).plot_roc_curve()
    plt.show()
    AUC_train_1 = metrics.roc_auc_score(y_train,t.predict_proba(X_train)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(y_test,t.predict_proba(X_test)[:,1])
    print('AUC_test:  
    print(classification_report(y_test, t.predict(X_test)))
    print('===============================================================')
LogisticRegression accuracy: 98.43
AUC_train: 0.690
AUC_test:  0.688
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
RandomForestClassifier accuracy: 98.46
AUC_train: 1.000
AUC_test:  0.803
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
AdaBoostClassifier accuracy: 98.46
AUC_train: 0.871
AUC_test:  0.856
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
GaussianNB accuracy: 94.93
AUC_train: 0.840
AUC_test:  0.847
              precision    recall  f1-score   support

           0       0.99      0.96      0.97      8259
           1       0.07      0.19      0.10       129

    accuracy                           0.95      8388
   macro avg       0.53      0.57      0.54      8388
weighted avg       0.97      0.95      0.96      8388

===============================================================

Create Hold Out predictions (meta-features)

In [13]:
def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

Create meta-features for training data

In [14]:
# Define 4-fold CV     ## można dać dowolną liczbę faud
cv = KFold(n_splits=6, random_state=SEED)    ## wpisuje ilości podziałow w cross-validation

# Loop over classifier to produce meta features
meta_train = []
for clf in base_clf:
    
    # Create hold out predictions for a classifier
    meta_train_clf = hold_out_predict(clf, X_train, y_train, cv)
    
    # Remove redundant column
    meta_train_clf = np.delete(meta_train_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_train.append(meta_train_clf)
    
meta_train = np.array(meta_train).T 
Starting hold out prediction with 6 splits for LogisticRegression.
Starting hold out prediction with 6 splits for RandomForestClassifier.
Starting hold out prediction with 6 splits for AdaBoostClassifier.
Starting hold out prediction with 6 splits for GaussianNB.

Create meta-features for testing data

In [15]:
meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T 

Predict on Stacking Classifier

In [16]:
# Set seed
if 'random_state' in stck_clf.get_params().keys():
    stck_clf.set_params(random_state=SEED)

# Optional (Add original features to meta)
original_flag = False
if original_flag:
    meta_train = np.concatenate((meta_train, X_train), axis=1)
    meta_test = np.concatenate((meta_test, X_test), axis=1)

# Fit model
stck_clf.fit(meta_train, y_train)

# Predict
y_pred = stck_clf.predict(meta_test)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred,average = 'macro')
auc = roc_auc_score(y_test, y_pred)

print('Stacking {} AUC: {:.4f}
Stacking LogisticRegression AUC: 98.4621
In [17]:
Classification_Assessment(stck_clf ,meta_train, y_train, meta_test, y_test)
Recall Training data:      0.0
Precision Training data:   0.0
----------------------------------------------------------------------
Recall Test data:          0.0
Precision Test data:       0.0
----------------------------------------------------------------------
Confusion Matrix Test data
[[8259    0]
 [ 129    0]]
----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

---------------------
AUC_train: 0.840
AUC_test:  0.846
---------------------
Accuracy Training data:      0.9847
Accuracy Test data:          0.9846
----------------------------------------------------------------------
Valuation for test data only:

OVERSAMPLING

First, a thick definition of homemade
In [18]:
def oversampling(ytrain, Xtrain):
    import matplotlib.pyplot as plt
    
    global Xtrain_OV
    global ytrain_OV

    calss1 = np.round((sum(ytrain == 1)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    calss0 = np.round((sum(ytrain == 0)/(sum(ytrain == 0)+sum(ytrain == 1))),decimals=2)*100
    
    print("y = 0: ", sum(ytrain == 0),'-------',calss0,'
    print("y = 1: ", sum(ytrain == 1),'-------',calss1,'
    print('--------------------------------------------------------')
    
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    print()
    
    Proporcja = sum(ytrain == 0) / sum(ytrain == 1)
    Proporcja = np.round(Proporcja, decimals=0)
    Proporcja = Proporcja.astype(int)
       
    ytrain_OV = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
    Xtrain_OV = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
    
    ytrain_OV = pd.concat([ytrain, ytrain_OV], axis = 0).reset_index(drop = True)
    Xtrain_OV = pd.concat([Xtrain, Xtrain_OV], axis = 0).reset_index(drop = True)
    
    Xtrain_OV = pd.DataFrame(Xtrain_OV)
    ytrain_OV = pd.DataFrame(ytrain_OV)
    

    
    print("Before oversampling Xtrain:     ", Xtrain.shape)
    print("Before oversampling ytrain:     ", ytrain.shape)
    print('--------------------------------------------------------')
    print("After oversampling Xtrain_OV:  ", Xtrain_OV.shape)
    print("After oversampling ytrain_OV:  ", ytrain_OV.shape)
    print('--------------------------------------------------------')
    
    
    ax = plt.subplot(1, 2, 1)
    ytrain.value_counts(dropna = False, normalize=True).plot(kind='pie',title='Before oversampling')
    plt.show
    
       
    kot = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0)
    kot = pd.concat([ytrain, kot], axis = 0).reset_index(drop = True)
    ax = plt.subplot(1, 2, 2)
    kot.value_counts(dropna = False, normalize=True).plot(kind='pie',title='After oversampling')
    plt.show

Reads the data again.

In [19]:
y = df['Stroke']
X = df.drop('Stroke', axis=1)
In [20]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)
# Jeżeli się rzuca wtedy wycinamy stratify=y.

I’m working on a matrix.

In [21]:
oversampling(y_train, X_train)
y = 0:  33036 ------- 98.0 
y = 1:  514 ------- 2.0 
--------------------------------------------------------

Before oversampling Xtrain:      (33550, 10)
Before oversampling ytrain:      (33550,)
--------------------------------------------------------
After oversampling Xtrain_OV:   (66446, 10)
After oversampling ytrain_OV:   (66446, 1)
--------------------------------------------------------
In [22]:
X_train = Xtrain_OV.values
X_test = X_test.values
y_train = ytrain_OV.values
y_test = y_test.values

Define Base (level 0) and Stacking (level 1) estimators

In [23]:
base_clf = [LogisticRegression(), RandomForestClassifier(), ### jakie model chce trenować
            AdaBoostClassifier(), GaussianNB()]

stck_OV = LogisticRegression()  ### układanie w stos odbyw się za pomocą LogisticRegression
#stck_clf = RandomForestClassifier()

Evaluate Base estimators separately

In [24]:
## Wstępna ocena bazowych estymatorów (modeli)
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score

for t in base_clf:
    
    # Set seed
    if 'kot' in t.get_params().keys():  # pobierz z modeli, które chce trenować kluczowe hiperparametry 
        t.set_params(random_state=SEED)  ## Podaje parametry PODSTAWOWE modelu, doyślne, fabryczne!!
                                           ## to znaczy, że jak podam specjalny hiperparament w modelu to będzie on uwzględniony             
    
    # Fit model
    t.fit(X_train, y_train) # Podstawiam do kolejnego modelu z pętli
    
    # Predict
    y_pred = t.predict(X_test)   #predykcja kolejnego modelu z pętli
    
    # Valuation
    acc = accuracy_score(y_test, y_pred)
    #pre = precision_score(y_test, y_pred,average = 'macro')
    #auc = roc_auc_score(y_test, y_pred)
    
    print('{} accuracy: {:.2f}
    
    plt.figure(figsize=(7,3))
    y_probas1 = t.predict_proba(X_test)[:,1]
    bc= BinaryClassification(y_test, y_probas1, labels=[t.__class__.__name__]).plot_roc_curve()
    plt.show()
    AUC_train_1 = metrics.roc_auc_score(y_train,t.predict_proba(X_train)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(y_test,t.predict_proba(X_test)[:,1])
    print('AUC_test:  
    print(classification_report(y_test, t.predict(X_test)))
    print('===============================================================')
LogisticRegression accuracy: 67.26
AUC_train: 0.817
AUC_test:  0.819
              precision    recall  f1-score   support

           0       1.00      0.67      0.80      8259
           1       0.04      0.83      0.07       129

    accuracy                           0.67      8388
   macro avg       0.52      0.75      0.44      8388
weighted avg       0.98      0.67      0.79      8388

===============================================================
RandomForestClassifier accuracy: 98.39
AUC_train: 1.000
AUC_test:  0.775
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

===============================================================
AdaBoostClassifier accuracy: 70.68
AUC_train: 0.871
AUC_test:  0.849
              precision    recall  f1-score   support

           0       1.00      0.70      0.83      8259
           1       0.04      0.84      0.08       129

    accuracy                           0.71      8388
   macro avg       0.52      0.77      0.45      8388
weighted avg       0.98      0.71      0.81      8388

===============================================================
GaussianNB accuracy: 71.15
AUC_train: 0.840
AUC_test:  0.847
              precision    recall  f1-score   support

           0       1.00      0.71      0.83      8259
           1       0.04      0.84      0.08       129

    accuracy                           0.71      8388
   macro avg       0.52      0.77      0.46      8388
weighted avg       0.98      0.71      0.82      8388

===============================================================

Create Hold Out predictions (meta-features)

In [25]:
def hold_out_predict(clf, X, y, cv):
        
    """Performing cross validation hold out predictions for stacking"""
    
    # USTALA WYMIARY
    n_classes = len(np.unique(y)) # Sprawdza jakie są klasy: len(np.unique(y)) = 2
    meta_features = np.zeros((X.shape[0], n_classes)) ## BUDUJE SZKIELEK WEKTORA META CECH 
                                         # Buduje wektor o ilości wierszy 10000 i 2 KOLUMN
                                         # składający się z samych zer
    n_splits = cv.get_n_splits(X, y)     # Zwraca liczbę iteracji podziału w walidatorze krzyżowym.= 4
    
    # Loop over folds
    print("Starting hold out prediction with {} splits for {}.".format(n_splits, clf.__class__.__name__))
    for train_idx, hold_out_idx in cv.split(X, y): 
        
        # Split data
        X_train = X[train_idx]                # Podmienia zmienne X_train w pętli
        y_train = y[train_idx]                # Podmienia zmienne y_train w pętli
        X_hold_out = X[hold_out_idx]
        
        # Fit estimator to K-1 parts and predict on hold out part
        est = make_copy(clf)
        est.fit(X_train, y_train)
        y_hold_out_pred = est.predict_proba(X_hold_out)
        
        # Fill in meta features
        meta_features[hold_out_idx] = y_hold_out_pred

    return meta_features     # meta wymiar to wektor 1000 na 2 kolumny składający się z samych zer

Create meta-features for training data

In [26]:
# Define 4-fold CV     ## można dać dowolną liczbę faud
cv = KFold(n_splits=6, random_state=SEED)    ## wpisuje ilości podziałow w cross-validation

# Loop over classifier to produce meta features
meta_train = []
for clf in base_clf:
    
    # Create hold out predictions for a classifier
    meta_train_clf = hold_out_predict(clf, X_train, y_train, cv)
    
    # Remove redundant column
    meta_train_clf = np.delete(meta_train_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_train.append(meta_train_clf)
    
meta_train = np.array(meta_train).T 
Starting hold out prediction with 6 splits for LogisticRegression.
Starting hold out prediction with 6 splits for RandomForestClassifier.
Starting hold out prediction with 6 splits for AdaBoostClassifier.
Starting hold out prediction with 6 splits for GaussianNB.

Create meta-features for testing data

In [27]:
meta_test = []
for i in base_clf:
    
    # Create hold out predictions for a classifier
    i.fit(X_train, y_train)
    meta_test_clf = i.predict_proba(X_test)
    
    # Remove redundant column
    meta_test_clf = np.delete(meta_test_clf, 0, axis=1).ravel()
    
    # Gather meta training data
    meta_test.append(meta_test_clf)
    
meta_test = np.array(meta_test).T 

Predict on Stacking Classifier

In [28]:
# Set seed
if 'random_state' in stck_OV.get_params().keys():
    stck_OV.set_params(random_state=SEED)

# Optional (Add original features to meta)
original_flag = False
if original_flag:
    meta_train = np.concatenate((meta_train, X_train), axis=1)
    meta_test = np.concatenate((meta_test, X_test), axis=1)

# Fit model
stck_OV.fit(meta_train, y_train)

# Predict
y_pred = stck_OV.predict(meta_test)

# Calculate accuracy
acc = accuracy_score(y_test, y_pred)
pre = precision_score(y_test, y_pred,average = 'macro')
auc = roc_auc_score(y_test, y_pred)

print('Stacking {} AUC: {:.4f}
Stacking LogisticRegression AUC: 98.4621
In [29]:
Classification_Assessment(stck_OV ,meta_train, ytrain_OV, meta_test, y_test)
Recall Training data:      1.0
Precision Training data:   0.9994
----------------------------------------------------------------------
Recall Test data:          0.0
Precision Test data:       0.0
----------------------------------------------------------------------
Confusion Matrix Test data
[[8259    0]
 [ 129    0]]
----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      8259
           1       0.00      0.00      0.00       129

    accuracy                           0.98      8388
   macro avg       0.49      0.50      0.50      8388
weighted avg       0.97      0.98      0.98      8388

---------------------
AUC_train: 1.000
AUC_test:  0.406
---------------------
Accuracy Training data:      0.9997
Accuracy Test data:          0.9846
----------------------------------------------------------------------
Valuation for test data only:

If we were processing Titanic data, I would expect less of a catastrophe like this. All in all I can’t explain what is wrong because it should play normally, because in models the thrashold point (red point) was at the top of the ROC cross. Unfortunately, when creating the second-level stacking classification, something got lost. This is not a mistake, because I repeated this analysis several times. Something is wrong and I don’t know what and I have no idea.
Now we will start playing thrashold sensitivity control so that the model finally begins classifying the result variables 1.

I wrote on the basis of the previous code, a program that will modernize the threshold. A thicket of numbers and names begins, so I introduced colors to print.
obraz.png

Threshold

In [30]:
def Classification_Assessment_by_Threshold(model ,Xtrain, ytrain, Xtest, ytest, threshold):
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import metrics
    from sklearn.metrics import classification_report, confusion_matrix
    from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
    from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
    from sklearn.metrics import accuracy_score
    import scikitplot as skplt
    from plot_metric.functions import BinaryClassification
    from sklearn.metrics import precision_recall_curve
    
    ### --------color------------------
    import colorama
    from colorama import Fore, Style

    ### ---------------New Threshold----------------------------------------   
    
    PRED_Threshold = np.where((model.predict_proba(Xtest)[:, 1])>= threshold,1,0)
       
    
    print("Recall Training data:     ", np.round(recall_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Precision Training data:  ", np.round(precision_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("----------------------------------------------------------------------")
    print("Recall Test data:         ", np.round(recall_score(ytest, model.predict(Xtest)), decimals=4)) 
    print("Precision Test data:      ", np.round(precision_score(ytest, model.predict(Xtest)), decimals=4))    
    print("----------------------------------------------------------------------")    
    
    print(Fore.BLUE + "Recall Test data (new_threshold):         ", np.round(recall_score(ytest, PRED_Threshold), decimals=4)) 
    print("Precision Test data (new_threshold):      ", np.round(precision_score(ytest,PRED_Threshold), decimals=4))
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print(confusion_matrix(ytest, model.predict(Xtest)))
    print(Fore.BLUE +"Confusion Matrix Test data - new_threshold: ",threshold)
    print(confusion_matrix(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    print("----------------------------------------------------------------------")
    
 # https://stackoverflow.com/questions/39473297/how-do-i-print-colored-output-with-python-3   
    print('Valuation for test data only:')
    print(classification_report(ytest, model.predict(Xtest)))
    print("----------------------------------------------------------------------")
    print(Fore.BLUE +'Valuation for test data only (new_threshold):',threshold)
    print(classification_report(ytest, PRED_Threshold))
    print(Style.RESET_ALL)
    ## ----------AUC-----------------------------------------
     
    print('---------------------') 
    AUC_train_1 = metrics.roc_auc_score(ytrain,model.predict_proba(Xtrain)[:,1])
    print('AUC_train: 
    AUC_test_1 = metrics.roc_auc_score(ytest,model.predict_proba(Xtest)[:,1])
    print('AUC_test:  
    print(Fore.BLUE +'AUC_test with new_threshold:', threshold)
    AUC_test_3 = metrics.roc_auc_score(ytest,PRED_Threshold) 
    print('AUC_test:  
    print('---------------------')    
    print(Style.RESET_ALL)
    
    print("Accuracy Training data:              ", np.round(accuracy_score(ytrain, model.predict(Xtrain)), decimals=4))
    print("Accuracy Test data                 : ", np.round(accuracy_score(ytest, model.predict(Xtest)), decimals=4))
    print(Fore.BLUE +"Accuracy Test data (new_threshold) : ", np.round(accuracy_score(ytest, PRED_Threshold), decimals=4)) 
    print("----------------------------------------------------------------------")
    print(Style.RESET_ALL)
    print('Valuation for test data only:')

    y_probas1 = PRED_Threshold
    y_probas3 = model.predict_proba(Xtest)[:,1]
    y_probas2 = model.predict_proba(Xtest)

### ---plot_roc_curve--------------------------------------------------------
    plt.figure(figsize=(13,4))

    plt.subplot(1, 2, 1)
    bc = BinaryClassification(ytest, y_probas1, labels=["Class 1", "Class 2"])
    bc2 = BinaryClassification(ytest, y_probas3, labels=["Class 1", "Class 2"])
    bc.plot_roc_curve()
    bc2.plot_roc_curve() 
    #plt.axvline(threshold, color = 'blue', linestyle = '--', label = 'new threshold')
    # plt.axvline(0.5, color = '#00C251', linestyle = '--', label = 'threshold = 0.5')

### --------precision_recall_curve------------------------------------------

    plt.subplot(1, 2, 2)
    precision, recall, thresholds = precision_recall_curve(ytest, y_probas1)
    plt.plot(recall, precision, marker='.', label=model)
    plt.title('Precision recall curve')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    #plt.legend(loc=(-0.30, -0.7))
    plt.show()

## ----------plot_roc-----------------------------------------

    skplt.metrics.plot_roc(ytest, y_probas2)
In [31]:
threshold = 0.3
Classification_Assessment_by_Threshold(stck_clf ,meta_train, y_train, meta_test, y_test, threshold)
Recall Training data:      0.8636
Precision Training data:   0.8517
----------------------------------------------------------------------
Recall Test data:          0.5504
Precision Test data:       0.0744
----------------------------------------------------------------------
Recall Test data (new_threshold):          0.7752
Precision Test data (new_threshold):       0.0522
----------------------------------------------------------------------

[[7376  883]
 [  58   71]]
Confusion Matrix Test data - new_threshold:  0.3
[[6442 1817]
 [  29  100]]

----------------------------------------------------------------------
Valuation for test data only:
              precision    recall  f1-score   support

           0       0.99      0.89      0.94      8259
           1       0.07      0.55      0.13       129

    accuracy                           0.89      8388
   macro avg       0.53      0.72      0.54      8388
weighted avg       0.98      0.89      0.93      8388

----------------------------------------------------------------------
Valuation for test data only (new_threshold): 0.3
              precision    recall  f1-score   support

           0       1.00      0.78      0.87      8259
           1       0.05      0.78      0.10       129

    accuracy                           0.78      8388
   macro avg       0.52      0.78      0.49      8388
weighted avg       0.98      0.78      0.86      8388


---------------------
AUC_train: 0.949
AUC_test:  0.854
AUC_test with new_threshold: 0.3
AUC_test:  0.778
---------------------

Accuracy Training data:               0.8558
Accuracy Test data                 :  0.8878
Accuracy Test data (new_threshold) :  0.7799
----------------------------------------------------------------------

Valuation for test data only:

Artykuł Stacking in Trident 1.1 [Stroke_Prediction.csv] pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Perfect model: Random forest classifier (1) https://sigmaquality.pl/uncategorized/perfect-model-random-forest-classifier-1-230320201052/ Mon, 23 Mar 2020 09:55:28 +0000 http://sigmaquality.pl/perfect-model-random-forest-classifier-1-230320201052/ part 1: Determining the depth of trees by visualization using visualization¶ 230320201052   In [1]: import numpy as np import matplotlib.pyplot as plt import seaborn as [...]

Artykuł Perfect model: Random forest classifier (1) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

part 1: Determining the depth of trees by visualization using visualization

230320201052

 

In [1]:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
In [2]:
import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/kaggletrain.csv')
df = df.dropna(how='any')
print(df.columns)
print(df.shape)
df.dtypes
Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
(183, 13)
Out[2]:
Unnamed: 0       int64
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [3]:
del df['Unnamed: 0']
df.columns
Out[3]:
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [4]:
df.head(3)
Out[4]:
  PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
 

Digitizing data in page format

In [5]:
df['Sex'] = pd.Categorical(df.Sex).codes
df['Ticket'] = pd.Categorical(df.Ticket).codes
df['Cabin'] = pd.Categorical(df.Ticket).codes
df['Embarked'] = pd.Categorical(df.Embarked).codes

df.dtypes
Out[5]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex               int8
Age            float64
SibSp            int64
Parch            int64
Ticket           int16
Fare           float64
Cabin            int16
Embarked          int8
dtype: object
In [6]:
df['Sex']=df['Sex'].astype('int64')
df['Age']=df['Age'].astype('int64')
df.dtypes
Out[6]:
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex              int64
Age              int64
SibSp            int64
Parch            int64
Ticket           int16
Fare           float64
Cabin            int16
Embarked          int8
dtype: object
 

Selection of variables divided into test and training set

In [7]:
import numpy as np
from sklearn.model_selection import train_test_split  

df2 = df[['Sex','Age','Pclass','Survived']]
X = df2[['Sex','Age']]
y = df2['Survived']

print('X :',X.shape)
print('y :',y.shape)
#Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)
X : (183, 2)
y : (183,)
 

Replacing dataframe with array

In [8]:
import numpy as np

y = np.asarray(y)
X = np.asarray(X)
In [9]:
print('X:',X.shape)
print('y:',y.shape)
X: (183, 2)
y: (183,)
 

Data normalization (standardization)

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
Xtrain = sc.fit_transform(Xtrain)
Xtest = sc.transform(Xtest)

 

How Random Forest classifies according to the depth of the tree

In [10]:
from helpers_05_08 import visualize_tree
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_blobs

        
fig, ax = plt.subplots(1, 4, figsize=(16, 3))
fig.subplots_adjust(left=0.02, right=0.98, wspace=0.1)

#X, y = make_blobs(n_samples=300, centers=4,
#                  random_state=0, cluster_std=1.0)

for axi, depth in zip(ax, range(1,5)):
    model = DecisionTreeClassifier(max_depth=depth)
    visualize_tree(model, X, y, ax=axi)
    axi.set_title('depth = {0}'.format(depth))

    
     
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
/home/wojciech/ATOS/helpers_05_08.py:34: UserWarning: The following kwargs were not used by contour: 'clim'
  zorder=1)
 

Random Forest model, depth 4

In [11]:
## MODEL    
from sklearn.ensemble import RandomForestClassifier

RF4 = RandomForestClassifier(max_depth=4, random_state=0)
RF4.fit(X, y)

# Predicting the Test set results
y_pred4 = RF4.predict(X)    
    

    
    
from matplotlib.colors import ListedColormap 
  
X_set, y_set = X, y 
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, 
                     stop = X_set[:, 0].max() + 1, step = 0.01), 
                     np.arange(start = X_set[:, 1].min() - 1, 
                     stop = X_set[:, 1].max() + 1, step = 0.01)) 
  
plt.contourf(X1, X2, 
             
             RF4.predict(np.array([X1.ravel(), 
             X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, 
             cmap = ListedColormap(('pink', 'white', 'grey'))) 
  
plt.xlim(X1.min(), X1.max()) 
plt.ylim(X2.min(), X2.max())

  
for i, j in enumerate(np.unique(y_set)): 
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j) 
  
plt.title('Logistic Regression (Training set)') 
plt.xlabel('Sex') # for Xlabel 
plt.ylabel('Age') # for Ylabel 
plt.legend() # to show legend 
  
# show scatter plot 
plt.show()
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
 

First of all, women except babies, as well as young boys up to 20 years old and young men from 20 to 30 years old were saved from the Titanic disaster. This is how he classifies the model based on two variables: sex and age.

Visualization of the Rendom Forest classification using trees 6 deep

In [12]:
def visualize_classifier(model, X, y, ax=None, cmap='Reds'):
    ax = ax or plt.gca()
    
    # Plot the training points
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap,
               clim=(y.min(), y.max()), zorder=3)
    ax.axis('tight')
    ax.axis('off')
    xlim = ax.get_xlim()
    ylim = ax.get_ylim()
    
    # fit the estimator
    model.fit(X, y)
    xx, yy = np.meshgrid(np.linspace(*xlim, num=200),
                         np.linspace(*ylim, num=200))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    # Create a color plot with the results
    n_classes = len(np.unique(y))
    contours = ax.contourf(xx, yy, Z, alpha=0.3,
                           levels=np.arange(n_classes + 1) - 0.5,
                           cmap=cmap, clim=(y.min(), y.max()),
                           zorder=1)

    ax.set(xlim=xlim, ylim=ylim)
    

## MODEL    
from sklearn.ensemble import RandomForestClassifier

RF6 = RandomForestClassifier(max_depth=6, random_state=0)
RF6.fit(X, y)

# Predicting the Test set results
y_pred6 = RF6.predict(X)    
    
    
visualize_classifier(RF6, X, y)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'
In [13]:
visualize_classifier(DecisionTreeClassifier(), X, y)
/home/wojciech/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:23: UserWarning: The following kwargs were not used by contour: 'clim'
 

We run a forest of 240 trees, 6 depth each

In [14]:
## MODEL    
from sklearn.ensemble import RandomForestClassifier

RF6 = RandomForestClassifier(n_estimators=240, max_depth=6, random_state=0)
RF6.fit(X, y)

# Predicting the Test set results
y_pred6 = RF6.predict(X)    
    





from matplotlib.colors import ListedColormap 
  
X_set, y_set = X, y 
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, 
                     stop = X_set[:, 0].max() + 1, step = 0.01), 
                     np.arange(start = X_set[:, 1].min() - 1, 
                     stop = X_set[:, 1].max() + 1, step = 0.01)) 
  
plt.contourf(X1, X2, 
             
             RF6.predict(np.array([X1.ravel(), 
             X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, 
             cmap = ListedColormap(('pink', 'white', 'grey'))) 
  
plt.xlim(X1.min(), X1.max()) 
plt.ylim(X2.min(), X2.max())

  
for i, j in enumerate(np.unique(y_set)): 
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j) 
  
plt.title('Logistic Regression (Training set)') 
plt.xlabel('Sex') # for Xlabel 
plt.ylabel('Age') # for Ylabel 
plt.legend() # to show legend 
  
# show scatter plot 
plt.show()
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
 

Increasing the number of trees for the variables 'Sex’ and 'Age’ has no effect over 100 trees.

In [16]:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import validation_curve

## źródło: https://www.dezyre.com/recipes/plot-validation-curve-in-python

## Przerabiam data frame na macierz

import numpy as np
X = np.asarray(X)
Y = np.asarray(y)

digits = load_digits()
# Create feature matrix and target vector
X, y = digits.data, digits.target
# Plot Validation Curve
    
# Create range of values for parameter
param_range = np.arange(1, 275, 2)

# Calculate accuracy on training and test set using range of parameter values
train_scores, test_scores = validation_curve(RandomForestClassifier(max_depth=6),
                               X, y, param_name="n_estimators", param_range=param_range,
                               cv=4, scoring="accuracy", n_jobs=-1)

  # Calculate mean and standard deviation for training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

    # Calculate mean and standard deviation for test set scores
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

    # Plot mean accuracy scores for training and test sets
plt.subplots(1, figsize=(17,5))
plt.plot(param_range, train_mean, label="Training score", color="black")
plt.plot(param_range, test_mean, label="Cross-validation score", color="dimgrey")

    # Plot accurancy bands for training and test sets
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, color="gray")
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, color="gainsboro")

    # Create plot    
plt.title("Validation Curve With Random Forest")
plt.xlabel("Number Of Trees")
plt.ylabel("Accuracy Score")
plt.tight_layout()
plt.legend(loc="best")
plt.show()

Artykuł Perfect model: Random forest classifier (1) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
How to use PCA in logistic regression? https://sigmaquality.pl/uncategorized/how-to-use-pca-in-logistic-regression-230320200907/ Mon, 23 Mar 2020 08:10:05 +0000 http://sigmaquality.pl/how-to-use-pca-in-logistic-regression-230320200907/ 230320200907 Principal component analysis (PCA) https://jakevdp.github.io/PythonDataScienceHandbook/05.08-random-forests.html https://www.geeksforgeeks.org/principal-component-analysis-with-python/ In [1]: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt df= [...]

Artykuł How to use PCA in logistic regression? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
230320200907

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


df= pd.read_csv('/home/wojciech/Pulpit/1/Stroke_Prediction_NUM.csv')
print(df.shape)

df.head(5)
(29062, 20)
Out[1]:
Unnamed: 0 ID Gender Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke Age_years Age_years_10 Gender_C Ever_Married_C Type_Of_Work_C Residence_C Smoking_Status_C Age_years_10_C
0 0 30650 Male 1 0 Yes Private Urban 87.96 39.2 never smoked 0 58.093151 (53.126, 59.076] 1 1 2 1 1 5
1 1 57008 Female 0 0 Yes Private Rural 69.04 35.9 formerly smoked 0 70.076712 (65.121, 74.11] 0 1 2 0 0 7
2 2 53725 Female 0 0 Yes Private Urban 77.59 17.7 formerly smoked 0 52.041096 (48.082, 53.126] 0 1 2 1 0 4
3 3 41553 Female 0 1 Yes Self-employed Rural 243.53 27.0 never smoked 0 75.104110 (74.11, 82.137] 0 1 3 0 1 8
4 4 16167 Female 0 0 Yes Private Rural 77.67 32.3 smokes 0 32.024658 (29.055, 36.058] 0 1 2 0 2 1

Analysis of the result variable’s balance level

In [2]:
del df['Unnamed: 0']
df.Stroke.value_counts(dropna = False, normalize=True)
Out[2]:
0    0.981144
1    0.018856
Name: Stroke, dtype: float64
In [3]:
df.columns
Out[3]:
Index(['ID', 'Gender', 'Hypertension', 'Heart_Disease', 'Ever_Married',
       'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI', 'Smoking_Status',
       'Stroke', 'Age_years', 'Age_years_10', 'Gender_C', 'Ever_Married_C',
       'Type_Of_Work_C', 'Residence_C', 'Smoking_Status_C', 'Age_years_10_C'],
      dtype='object')

Split into test and result set

In [4]:
df2 = df[['Hypertension','Heart_Disease','Avg_Glucose','BMI','Stroke','Age_years','Gender_C','Ever_Married_C','Type_Of_Work_C','Residence_C','Smoking_Status_C','Age_years_10_C']]
In [5]:
y = df2['Stroke']
X = df2.drop('Stroke', axis=1)
In [6]:
from sklearn.model_selection import train_test_split 
Xtrain, Xtest, ytrain, ytest = train_test_split(X,y, test_size=0.33, stratify = y, random_state = 148)

print ('Zbiór X treningowy: ',Xtrain.shape)
print ('Zbiór X testowy:    ', Xtest.shape)
print ('Zbiór y treningowy: ', ytrain.shape)
print ('Zbiór y testowy:    ', ytest.shape)
Zbiór X treningowy:  (19471, 11)
Zbiór X testowy:     (9591, 11)
Zbiór y treningowy:  (19471,)
Zbiór y testowy:     (9591,)
In [7]:
print("ytrain = 0: ", sum(ytrain == 0))
print("ytrain = 1: ", sum(ytrain == 1))
ytrain = 0:  19104
ytrain = 1:  367
In [8]:
Proporcja = sum(ytrain == 0) / sum(ytrain == 1) 
Proporcja = np.round(Proporcja, decimals=0)
Proporcja = Proporcja.astype(int)
print('Ilość 0 Stroke na 1 Stroke: ', Proporcja)
Ilość 0 Stroke na 1 Stroke:  52
In [9]:
ytrain_OVSA = pd.concat([ytrain[ytrain==1]] * Proporcja, axis = 0) 
ytrain_OVSA.count()
Out[9]:
19084

We have increased the number of result variables 1. We now have the same number of rows of result variables and independent variables. We are now introducing new additional variables 1 to the training set.

In [10]:
Xtrain_OVSA = pd.concat([Xtrain.loc[ytrain==1, :]] * Proporcja, axis = 0)
ytrain_OVSA.count()
Out[10]:
19084
In [11]:
ytrain_OVSA = pd.concat([ytrain, ytrain_OVSA], axis = 0).reset_index(drop = True)
Xtrain_OVSA = pd.concat([Xtrain, Xtrain_OVSA], axis = 0).reset_index(drop = True)

print("ilość elementów w zbiorze Xtrain:     ", Xtrain.BMI.count())
print("ilość elementów w zbiorze Xtrain_OVSA: ", Xtrain_OVSA.BMI.count())
print("ilość elementów w zbiorze ytrain:     ", ytrain.count())
print("ilość elementów w zbiorze ytrain_OVSA: ", ytrain_OVSA.count())
ilość elementów w zbiorze Xtrain:      19471
ilość elementów w zbiorze Xtrain_OVSA:  38555
ilość elementów w zbiorze ytrain:      19471
ilość elementów w zbiorze ytrain_OVSA:  38555

Result set balance level:

In [12]:
ytrain_OVSA.value_counts(dropna = False, normalize=True).plot(kind='pie')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff758b04950>

Logistic regression model

In [13]:
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(Xtrain_OVSA, ytrain_OVSA) 
y_pred_LRC = LR_Grid.predict(Xtest)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Model assessment:

In [14]:
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain_OVSA, LR_Grid.predict(Xtrain_OVSA)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain_OVSA, LR_Grid.predict(Xtrain_OVSA)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC)
auc = metrics.roc_auc_score(ytest, y_pred_LRC)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)
Recall Training data:      0.7956
Precision Training data:   0.7522
----------------------------------------------------------------------
Recall Test data:          0.7735
Precision Test data:       0.0517
----------------------------------------------------------------------
Confusion Matrix Test data
[[6840 2570]
 [  41  140]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.73      0.84      9410
           1       0.05      0.77      0.10       181

    accuracy                           0.73      9591
   macro avg       0.52      0.75      0.47      9591
weighted avg       0.98      0.73      0.83      9591

auc 0.7501834770815108

Principal component analysis (PCA)

Standardization of Xtrain_OVSA and Xtest variables

In [15]:
from sklearn.preprocessing import StandardScaler 
sc = StandardScaler() 
  
X_train_PCA = sc.fit_transform(Xtrain_OVSA) 
X_test_PCA = sc.transform(Xtest)
In [16]:
print(X_train_PCA.shape)
print(X_test_PCA.shape)
(38555, 11)
(9591, 11)

PCA transformation of two variables

In [17]:
from sklearn.decomposition import PCA 
  
pca = PCA(n_components = 2) 
  
X_train_PCA2 = pca.fit_transform(X_train_PCA) 
X_test_PCA2 = pca.transform(X_test_PCA) 
In [18]:
pca.fit(X_train_PCA2)
Out[18]:
PCA(copy=True, iterated_power='auto', n_components=2, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)
In [19]:
explained_variance = pca.explained_variance_ratio_ 
explained_variance
Out[19]:
array([0.63231701, 0.36768299])
In [20]:
pca.components_
Out[20]:
array([[1., 0.],
       [0., 1.]])
In [21]:
def draw_vector(v0, v1, ax=None):
    ax = ax or plt.gca()
    arrowprops=dict(arrowstyle='->',
                    linewidth=3,
                    color='red',
                    shrinkA=0, shrinkB=0)
    ax.annotate('', v1, v0, arrowprops=arrowprops)

# plot data
plt.scatter(X_test_PCA2[:, 0], X_test_PCA2[:, 1], alpha=0.3)
for length, vector in zip(pca.explained_variance_, pca.components_):
    v = vector * 3 * np.sqrt(length)
    draw_vector(pca.mean_, pca.mean_ + v)
plt.axis('equal');

We substitute again for the logistic regression model

In [22]:
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid2 = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid2.fit(X_train_PCA2, ytrain_OVSA) 
y_pred_LRC2 = LR_Grid2.predict(X_test_PCA2)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
In [23]:
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain_OVSA, LR_Grid2.predict(X_train_PCA2)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain_OVSA, LR_Grid2.predict(X_train_PCA2)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid2.predict(X_test_PCA2)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid2.predict(X_test_PCA2)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid2.predict(X_test_PCA2)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid2.predict(X_test_PCA2)))
y_pred_proba = LR_Grid2.predict_proba(X_test_PCA2)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC2)
auc = metrics.roc_auc_score(ytest, y_pred_LRC2)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)
Recall Training data:      0.782
Precision Training data:   0.7434
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.051
----------------------------------------------------------------------
Confusion Matrix Test data
[[6751 2659]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.72      0.83      9410
           1       0.05      0.79      0.10       181

    accuracy                           0.72      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.72      0.82      9591

auc 0.7537417582094985
In [24]:
print(X_train_PCA2.shape)
print(ytrain.shape)
(38555, 2)
(19471,)

It is clear that PCA improved the auc from 0.750 to 0.753.

Cluster visualisation

It is possible in such a graphic format only because there are two variables (after the PCA transformation, because there were 11 before). The area for assessing auc classification before and after PCA transformation is similar and is 0.75. now you can see what chapter ka looks like

For the training set

In [25]:
# Predicting the training set 
# result through scatter plot  
from matplotlib.colors import ListedColormap 
  
X_set, y_set = X_train_PCA2, ytrain_OVSA 
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, 
                     stop = X_set[:, 0].max() + 1, step = 0.01), 
                     np.arange(start = X_set[:, 1].min() - 1, 
                     stop = X_set[:, 1].max() + 1, step = 0.01)) 
  
plt.contourf(X1, X2, LR_Grid2.predict(np.array([X1.ravel(), 
             X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, 
             cmap = ListedColormap(('pink', 'white', 'lightgreen'))) 
  
plt.xlim(X1.min(), X1.max()) 
plt.ylim(X2.min(), X2.max()) 
  
for i, j in enumerate(np.unique(y_set)): 
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j) 
  
plt.title('Logistic Regression (Training set)') 
plt.xlabel('PC1') # for Xlabel 
plt.ylabel('PC2') # for Ylabel 
plt.legend() # to show legend 
  
# show scatter plot 
plt.show()
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

For the test set

In [26]:
# Predicting the training set 
# result through scatter plot  
from matplotlib.colors import ListedColormap 
  
X_set, y_set = X_test_PCA2, ytest 
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, 
                     stop = X_set[:, 0].max() + 1, step = 0.01), 
                     np.arange(start = X_set[:, 1].min() - 1, 
                     stop = X_set[:, 1].max() + 1, step = 0.01)) 
  
plt.contourf(X1, X2, LR_Grid2.predict(np.array([X1.ravel(), 
             X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, 
             cmap = ListedColormap(('pink', 'white', 'lightgreen'))) 
  
plt.xlim(X1.min(), X1.max()) 
plt.ylim(X2.min(), X2.max()) 
  
for i, j in enumerate(np.unique(y_set)): 
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], 
                c = ListedColormap(('red', 'green', 'blue'))(i), label = j) 
  
plt.title('Logistic Regression (Training set)') 
plt.xlabel('PC1') # for Xlabel 
plt.ylabel('PC2') # for Ylabel 
plt.legend() # to show legend 
  
# show scatter plot 
plt.show()
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.
'c' argument looks like a single numeric RGB or RGBA sequence, which should be avoided as value-mapping will have precedence in case its length matches with 'x' & 'y'.  Please use a 2-D array with a single row if you really want to specify the same RGB or RGBA value for all points.

PCA transformation of three variables

In [27]:
from sklearn.decomposition import PCA 
  
pca3 = PCA(n_components = 3) 
  
X_train_PCA3 = pca3.fit_transform(X_train_PCA) 
X_test_PCA3 = pca3.transform(X_test_PCA) 
In [28]:
pca3.fit(X_train_PCA)
Out[28]:
PCA(copy=True, iterated_power='auto', n_components=3, random_state=None,
    svd_solver='auto', tol=0.0, whiten=False)

Variance of the top 3 variables

The higher the variance, the better.

In [29]:
explained_variance = pca3.explained_variance_ratio_ 
explained_variance
Out[29]:
array([0.20660751, 0.12013921, 0.10116037])
In [30]:
pca.components_
Out[30]:
array([[1., 0.],
       [0., 1.]])

Again, we substitute for the logistic regression model

In [31]:
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid3 = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid3.fit(X_train_PCA3, ytrain_OVSA) 
y_pred_LRC3 = LR_Grid3.predict(X_test_PCA3)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
In [32]:
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain_OVSA, LR_Grid3.predict(X_train_PCA3)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain_OVSA, LR_Grid3.predict(X_train_PCA3)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid3.predict(X_test_PCA3)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid3.predict(X_test_PCA3)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid3.predict(X_test_PCA3)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid3.predict(X_test_PCA3)))
y_pred_proba = LR_Grid3.predict_proba(X_test_PCA3)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC3)
auc = metrics.roc_auc_score(ytest, y_pred_LRC3)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)
Recall Training data:      0.7875
Precision Training data:   0.7413
----------------------------------------------------------------------
Recall Test data:          0.7901
Precision Test data:       0.0499
----------------------------------------------------------------------
Confusion Matrix Test data
[[6690 2720]
 [  38  143]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.99      0.71      0.83      9410
           1       0.05      0.79      0.09       181

    accuracy                           0.71      9591
   macro avg       0.52      0.75      0.46      9591
weighted avg       0.98      0.71      0.82      9591

auc 0.7505005254783613
In [34]:
print(X_train_PCA3.shape)
print(ytrain.shape)
(38555, 3)
(19471,)

Artykuł How to use PCA in logistic regression? pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Part. 2 How to improve the classification model? Principal component analysis (PCA) https://sigmaquality.pl/uncategorized/part-2-how-to-improve-the-classification-model_-principal-component-analysis-pca-200320200904/ Fri, 20 Mar 2020 08:08:11 +0000 http://sigmaquality.pl/part-2-how-to-improve-the-classification-model_-principal-component-analysis-pca-200320200904/ 200320200904 In this case, the method did not improve the model. However, there are models in which the PCA method is a very important reason [...]

Artykuł Part. 2 How to improve the classification model? Principal component analysis (PCA) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

200320200904

In this case, the method did not improve the model. However, there are models in which the PCA method is a very important reason for improving the properties of the model.
Loads data from the Titanic database.

In [1]:
import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/kaggletrain.csv')
df = df.dropna(how='any')
df.dtypes
Out[1]:
Unnamed: 0       int64
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [2]:
df.columns
Out[2]:
Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [3]:
df.head(3)
Out[3]:
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
6 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S

Digitizing data in page format

In [4]:
df['Sex'] = pd.Categorical(df.Sex).codes
df['Ticket'] = pd.Categorical(df.Ticket).codes
df['Cabin'] = pd.Categorical(df.Ticket).codes
df['Embarked'] = pd.Categorical(df.Embarked).codes

Selection of variables divided into test and training set

In [5]:
import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Pclass', 'Sex', 'Age','SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

Data normalization (standardization)

PCA works best with a standardized feature set. We will perform standard scalar normalization to normalize our feature set.

In [6]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
Xtrain = sc.fit_transform(Xtrain)
Xtest = sc.transform(Xtest)

Principal component analysis (PCA)

In [7]:
from sklearn.decomposition import PCA

pca = PCA()
Xtrain = pca.fit_transform(Xtrain)
Xtest = pca.transform(Xtest)

We did not provide the number of components in the constructor. Therefore, all 9 variables from the set will be returned for both the training and test set.

The PCA class contains, explained_variance_ratio_which returns the variance called by each variable.

In [8]:
explained_variance = pca.explained_variance_ratio_
In [9]:
SOK = np.round(explained_variance, decimals=2)
SOK
Out[9]:
array([0.25, 0.18, 0.18, 0.11, 0.1 , 0.07, 0.06, 0.04, 0.  ])
In [10]:
KOT = dict(zip(X, SOK))

KOT_sorted_keys = sorted(KOT, key=KOT.get, reverse=True)

for r in KOT_sorted_keys:
    print (r, KOT[r])

    KOT
Pclass 0.25
Sex 0.18
Age 0.18
SibSp 0.11
Parch 0.1
Ticket 0.07
Fare 0.06
Cabin 0.04
Embarked 0.0

We’re looking for one, the best independent variable in the model

In [11]:
from sklearn.decomposition import PCA

pca = PCA(n_components=1)
Xtrain = pca.fit_transform(Xtrain)
Xtest = pca.transform(Xtest)
In [12]:
from sklearn.ensemble import RandomForestClassifier

RF4 = RandomForestClassifier(max_depth=2, random_state=0)
RF4.fit(Xtrain, ytrain)

# Predicting the Test set results
y_pred1 = RF4.predict(Xtest)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [13]:
# model assessment
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, RF4.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, RF4.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, RF4.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, RF4.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, RF4.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, RF4.predict(Xtest)))
y_pred_proba = RF4.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred1)
auc = metrics.roc_auc_score(ytest, y_pred1)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)
Recall Training data:      0.9877
Precision Training data:   0.6504
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.7692
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 1 12]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.33      0.08      0.12        13
           1       0.77      0.95      0.85        42

    accuracy                           0.75        55
   macro avg       0.55      0.51      0.49        55
weighted avg       0.67      0.75      0.68        55

<Figure size 640x480 with 1 Axes>
auc 0.5146520146520146

We’re looking for the two best independent variables in the model

In [14]:
import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Pclass', 'Sex', 'Age','SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)
In [15]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
Xtrain = sc.fit_transform(Xtrain)
Xtest = sc.transform(Xtest)

PCA algorithm

In [16]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
Xtrain = pca.fit_transform(Xtrain)
Xtest = pca.transform(Xtest)
In [17]:
from sklearn.ensemble import RandomForestClassifier

RF2 = RandomForestClassifier(max_depth=2, random_state=0)
RF2.fit(Xtrain, ytrain)

# Predicting the Test set results
y_pred2 = RF2.predict(Xtest)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [18]:
# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, RF2.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, RF2.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, RF2.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, RF2.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, RF2.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, RF2.predict(Xtest)))
y_pred_proba = RF2.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred2)
auc = metrics.roc_auc_score(ytest, y_pred2)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)
Recall Training data:      0.9383
Precision Training data:   0.6972
----------------------------------------------------------------------
Recall Test data:          0.9524
Precision Test data:       0.8
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 3 10]
 [ 2 40]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.23      0.33        13
           1       0.80      0.95      0.87        42

    accuracy                           0.78        55
   macro avg       0.70      0.59      0.60        55
weighted avg       0.75      0.78      0.74        55

auc 0.5915750915750915

We are looking for the three best independent variables in the model

In [19]:
import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Pclass', 'Sex', 'Age','SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)
In [20]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
Xtrain = sc.fit_transform(Xtrain)
Xtest = sc.transform(Xtest)
In [21]:
#### Algorytm PCA
In [22]:
from sklearn.decomposition import PCA

pca = PCA(n_components=3)
Xtrain = pca.fit_transform(Xtrain)
Xtest = pca.transform(Xtest)
In [23]:
from sklearn.ensemble import RandomForestClassifier

RF3 = RandomForestClassifier(max_depth=2, random_state=0)
RF3.fit(Xtrain, ytrain)

# Predicting the Test set results
y_pred = RF3.predict(Xtest)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
In [24]:
# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, RF3.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, RF3.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, RF3.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, RF3.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, RF3.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, RF3.predict(Xtest)))
y_pred_proba = RF3.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred)
auc = metrics.roc_auc_score(ytest, y_pred)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)
Recall Training data:      0.9136
Precision Training data:   0.7115
----------------------------------------------------------------------
Recall Test data:          0.9048
Precision Test data:       0.8444
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 6  7]
 [ 4 38]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.46      0.52        13
           1       0.84      0.90      0.87        42

    accuracy                           0.80        55
   macro avg       0.72      0.68      0.70        55
weighted avg       0.79      0.80      0.79        55

auc 0.6831501831501832
In [25]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(Xtrain)
In [26]:
X.columns
Out[26]:
Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin',
       'Embarked'],
      dtype='object')

Artykuł Part. 2 How to improve the classification model? Principal component analysis (PCA) pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feature Selection Techniques – Random Forest Classifier https://sigmaquality.pl/uncategorized/part-1-how-to-improve-the-classification-model-rfc-feature_importances_200320200724/ Fri, 20 Mar 2020 06:29:27 +0000 http://sigmaquality.pl/part-1-how-to-improve-the-classification-model-rfc-feature_importances_200320200724/ 200320200724 In [1]: import pandas as pd df = pd.read_csv('/home/wojciech/Pulpit/1/kaggletrain.csv') df = df.dropna(how='any') df.dtypes Out[1]: Unnamed: 0 int64 PassengerId int64 Survived int64 Pclass int64 Name [...]

Artykuł Feature Selection Techniques – Random Forest Classifier pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

200320200724

In [1]:

import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/kaggletrain.csv')
df = df.dropna(how='any')
df.dtypes
Out[1]:
Unnamed: 0       int64
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [2]:
df.columns
Out[2]:
Index(['Unnamed: 0', 'PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age',
       'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
In [3]:
df.head(3)
Out[3]:
Unnamed: 0 PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
3 3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
6 6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
In [4]:
df.dtypes
Out[4]:
Unnamed: 0       int64
PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object
In [5]:
df['Sex'] = pd.Categorical(df.Sex).codes
df['Ticket'] = pd.Categorical(df.Ticket).codes
df['Cabin'] = pd.Categorical(df.Ticket).codes
df['Embarked'] = pd.Categorical(df.Embarked).codes
In [6]:
import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Pclass', 'Sex', 'Age','SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

Simple classification model: Logistic regression model

In [7]:
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(Xtrain, ytrain) 
y_pred_LRC = LR_Grid.predict(Xtest)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
In [8]:
# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))

y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC)
auc = metrics.roc_auc_score(ytest, y_pred_LRC)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')

print('auc',auc)
plt.show()
Recall Training data:      0.7901
Precision Training data:   0.8205
----------------------------------------------------------------------
Recall Test data:          0.8333
Precision Test data:       0.875
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 8  5]
 [ 7 35]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.62      0.57        13
           1       0.88      0.83      0.85        42

    accuracy                           0.78        55
   macro avg       0.70      0.72      0.71        55
weighted avg       0.79      0.78      0.79        55

auc 0.7243589743589745
<Figure size 640x480 with 1 Axes>

The model that uses all independent variables has the value AUC = 0.72

We check which independent variables are the best

In [9]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X, y)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/ensemble/forest.py:245: FutureWarning: The default value of n_estimators will change from 10 in version 0.20 to 100 in 0.22.
  "10 in version 0.20 to 100 in 0.22.", FutureWarning)
Out[9]:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
In [10]:
importance = rfc.feature_importances_
In [11]:
importance = np.round(importance, decimals=3)
importance
Out[11]:
array([0.009, 0.195, 0.254, 0.026, 0.029, 0.114, 0.212, 0.139, 0.022])

We sort variables by importance

This is a very useful function when there are many variables.

In [12]:
KOT = dict(zip(X, importance))
KOT_sorted_keys = sorted(KOT, key=KOT.get, reverse=True)

for r in KOT_sorted_keys:
    print (r, KOT[r])

    KOT
Age 0.254
Fare 0.212
Sex 0.195
Cabin 0.139
Ticket 0.114
Parch 0.029
SibSp 0.026
Embarked 0.022
Pclass 0.009

Warning! The variables with the highest values have the highest importance.

We only use the most relevant variables: 'Sex’, 'Age’, 'Fare’, 'Ticket’

In [13]:
import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Age', 'Sex', 'Fare','Ticket']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)

Simple classification model: Logistic regression model

In [14]:
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(Xtrain, ytrain) 
y_pred_LRC = LR_Grid.predict(Xtest)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
In [15]:
# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC)
auc = metrics.roc_auc_score(ytest, y_pred_LRC)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)
Recall Training data:      0.7901
Precision Training data:   0.8649
----------------------------------------------------------------------
Recall Test data:          0.7857
Precision Test data:       0.9167
----------------------------------------------------------------------
Confusion Matrix Test data
[[10  3]
 [ 9 33]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.53      0.77      0.62        13
           1       0.92      0.79      0.85        42

    accuracy                           0.78        55
   macro avg       0.72      0.78      0.74        55
weighted avg       0.82      0.78      0.79        55

auc 0.7774725274725274

Including the 'Pclass’ variable has compromised model properties!

It is widely known that the first class passengers had a better chance of being saved.
Feature_importances analysis showed that the variable 'Pclass’ is of little importance in the classification process.
The addition of this variable to the classification model resulted in a deterioration of the AUC surface.

In [16]:
import numpy as np
from sklearn.model_selection import train_test_split  

X = df[[ 'Age', 'Sex', 'Fare','Ticket','Pclass']]
y = df['Survived']

Xtrain, Xtest, ytrain, ytest = train_test_split(X,y,test_size = 0.3, random_state = 0)
In [17]:
from sklearn import model_selection
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = -1, cv=2)

LR_Grid.fit(Xtrain, ytrain) 
y_pred_LRC = LR_Grid.predict(Xtest)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_search.py:814: DeprecationWarning: The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.
  DeprecationWarning)
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
In [18]:
# ocena modelu
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report

print("Recall Training data:     ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("Precision Training data:  ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data:         ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=4)) 
print("Precision Test data:      ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest,  y_pred_LRC)
auc = metrics.roc_auc_score(ytest, y_pred_LRC)
plt.plot(fpr, tpr, label='Logistic Regression (auc = 
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
print('auc',auc)
Recall Training data:      0.8025
Precision Training data:   0.8553
----------------------------------------------------------------------
Recall Test data:          0.8571
Precision Test data:       0.9
----------------------------------------------------------------------
Confusion Matrix Test data
[[ 9  4]
 [ 6 36]]
----------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.60      0.69      0.64        13
           1       0.90      0.86      0.88        42

    accuracy                           0.82        55
   macro avg       0.75      0.77      0.76        55
weighted avg       0.83      0.82      0.82        55

auc 0.7747252747252747

Artykuł Feature Selection Techniques – Random Forest Classifier pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Part_7 Stroke_Prediction – Model Sieci neuronowych PyTorch Technika Osadzania https://sigmaquality.pl/uncategorized/part_7-stroke_prediction-model-sieci-neuronowych-pytorch-technika-osadzania/ Mon, 09 Mar 2020 09:13:47 +0000 http://sigmaquality.pl/part_7-stroke_prediction-model-sieci-neuronowych-pytorch-technika-osadzania/ In [1]: import time start_time = time.time() ## pomiar czasu: start pomiaru czasu print(time.ctime()) Mon Mar 9 09:36:05 2020 In [2]: import torch import torch.nn as [...]

Artykuł Part_7 Stroke_Prediction – Model Sieci neuronowych PyTorch Technika Osadzania pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
In [1]:

import time
start_time = time.time() ## pomiar czasu: start pomiaru czasu
print(time.ctime())
Mon Mar  9 09:36:05 2020
In [2]:
import torch
import torch.nn as nn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Importuję dane

In [3]:
import pandas as pd

df = pd.read_csv('/home/wojciech/Pulpit/1/Stroke_Prediction_CLEAR.csv')
df.head(3)
Out[3]:
Unnamed: 0 ID Gender Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke Age_years Age_years_10
0 1 30650 Male 1 0 Yes Private Urban 87.96 39.2 never smoked 0 58.093151 (53.126, 59.076]
1 3 57008 Female 0 0 Yes Private Rural 69.04 35.9 formerly smoked 0 70.076712 (65.121, 74.11]
2 6 53725 Female 0 0 Yes Private Urban 77.59 17.7 formerly smoked 0 52.041096 (48.082, 53.126]
In [4]:
df.shape
Out[4]:
(29062, 14)
In [5]:
df.Stroke.value_counts().plot(kind='pie', autopct='
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f07b92ee390>

Uporządkowanie kolumn z danymi kategorycznymi i danymi ciągłymi

In [6]:
df.columns
Out[6]:
Index(['Unnamed: 0', 'ID', 'Gender', 'Hypertension', 'Heart_Disease',
       'Ever_Married', 'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI',
       'Smoking_Status', 'Stroke', 'Age_years', 'Age_years_10'],
      dtype='object')
In [7]:
df.Type_Of_Work
Out[7]:
0              Private
1              Private
2              Private
3        Self-employed
4              Private
             ...      
29057         children
29058         Govt_job
29059          Private
29060          Private
29061          Private
Name: Type_Of_Work, Length: 29062, dtype: object
In [8]:
categorical_columns = ['Gender','Hypertension', 'Heart_Disease','Ever_Married','Type_Of_Work','Residence','Smoking_Status','Age_years_10']
numerical_columns = ['Avg_Glucose', 'BMI', 'Age_years']

Ustalamy, że zmienną wynikową jest kolumna 'Stroke’

In [9]:
outputs = ['Stroke']

Cyfryzacja zmiennych tekstowych

In [10]:
df.dtypes
Out[10]:
Unnamed: 0          int64
ID                  int64
Gender             object
Hypertension        int64
Heart_Disease       int64
Ever_Married       object
Type_Of_Work       object
Residence          object
Avg_Glucose       float64
BMI               float64
Smoking_Status     object
Stroke              int64
Age_years         float64
Age_years_10       object
dtype: object

Musimy przekonwertować typy dla kolumn jakościowych na category. Możemy to zrobić za pomocą astype() funkcji, jak pokazano poniżej:

Wprowadzam nowy typ danych: 'category’

In [11]:
for category in categorical_columns:
    df[category] = df[category].astype('category')
In [12]:
df.dtypes
Out[12]:
Unnamed: 0           int64
ID                   int64
Gender            category
Hypertension      category
Heart_Disease     category
Ever_Married      category
Type_Of_Work      category
Residence         category
Avg_Glucose        float64
BMI                float64
Smoking_Status    category
Stroke               int64
Age_years          float64
Age_years_10      category
dtype: object
In [13]:
df['Residence'].cat.categories
Out[13]:
Index(['Rural', 'Urban'], dtype='object')
In [14]:
df['Ever_Married'].cat.categories
Out[14]:
Index(['No', 'Yes'], dtype='object')
In [15]:
df['Age_years_10'].cat.categories
Out[15]:
Index(['(22.041, 29.055]', '(29.055, 36.058]', '(36.058, 42.132]',
       '(42.132, 48.082]', '(48.082, 53.126]', '(53.126, 59.076]',
       '(59.076, 65.121]', '(65.121, 74.11]', '(74.11, 82.137]',
       '(9.999, 22.041]'],
      dtype='object')

Cyfryzacja danych

In [16]:
df.dtypes
Out[16]:
Unnamed: 0           int64
ID                   int64
Gender            category
Hypertension      category
Heart_Disease     category
Ever_Married      category
Type_Of_Work      category
Residence         category
Avg_Glucose        float64
BMI                float64
Smoking_Status    category
Stroke               int64
Age_years          float64
Age_years_10      category
dtype: object

Dlaczego zcyfrowaliśmy dane w formacie?

Podstawowym celem oddzielenia kolumn kategorycznych od kolumn numerycznych jest to, że wartości w kolumnie numerycznej mogą być bezpośrednio wprowadzane do sieci neuronowych. Jednak wartości kolumn kategorialnych należy najpierw przekonwertować na typy liczbowe.

In [17]:
categorical_columns
Out[17]:
['Gender',
 'Hypertension',
 'Heart_Disease',
 'Ever_Married',
 'Type_Of_Work',
 'Residence',
 'Smoking_Status',
 'Age_years_10']

Konwersja zmiennych kategorycznych na macierz Numpy

In [18]:
p1 = df['Gender'].cat.codes.values
p2 = df['Hypertension'].cat.codes.values
p3 = df['Heart_Disease'].cat.codes.values
p4 = df['Ever_Married'].cat.codes.values
p5 = df['Type_Of_Work'].cat.codes.values
p6 = df['Residence'].cat.codes.values
p7 = df['Smoking_Status'].cat.codes.values
p8 = df['Age_years_10'].cat.codes.values

NumP_matrix = np.stack([p1, p2, p3, p4, p5, p6, p7, p8], 1)

NumP_matrix[:10]
Out[18]:
array([[1, 1, 0, 1, 2, 1, 1, 5],
       [0, 0, 0, 1, 2, 0, 0, 7],
       [0, 0, 0, 1, 2, 1, 0, 4],
       [0, 0, 1, 1, 3, 0, 1, 8],
       [0, 0, 0, 1, 2, 0, 2, 1],
       [0, 1, 0, 1, 3, 1, 1, 7],
       [1, 0, 1, 1, 2, 1, 0, 8],
       [0, 0, 0, 1, 2, 0, 1, 2],
       [0, 0, 0, 1, 2, 0, 0, 2],
       [0, 0, 0, 1, 2, 0, 1, 2]], dtype=int8)

Tworzenie tensora Pytorch z macierzy Numpy

In [19]:
categorical_data = torch.tensor(NumP_matrix, dtype=torch.int64)
categorical_data[:10]
Out[19]:
tensor([[1, 1, 0, 1, 2, 1, 1, 5],
        [0, 0, 0, 1, 2, 0, 0, 7],
        [0, 0, 0, 1, 2, 1, 0, 4],
        [0, 0, 1, 1, 3, 0, 1, 8],
        [0, 0, 0, 1, 2, 0, 2, 1],
        [0, 1, 0, 1, 3, 1, 1, 7],
        [1, 0, 1, 1, 2, 1, 0, 8],
        [0, 0, 0, 1, 2, 0, 1, 2],
        [0, 0, 0, 1, 2, 0, 0, 2],
        [0, 0, 0, 1, 2, 0, 1, 2]])

Konwersja kolumn numerycznych DataFrame na tensor Pytorch

In [20]:
numerical_data = np.stack([df[col].values for col in numerical_columns], 1)
numerical_data = torch.tensor(numerical_data, dtype=torch.float)
numerical_data[:5]
Out[20]:
tensor([[ 87.9600,  39.2000,  58.0932],
        [ 69.0400,  35.9000,  70.0767],
        [ 77.5900,  17.7000,  52.0411],
        [243.5300,  27.0000,  75.1041],
        [ 77.6700,  32.3000,  32.0247]])

Konwersja zmiennych wynikowych na tensor Pytorch

In [21]:
outputs = torch.tensor(df[outputs].values).flatten()
outputs[:5]
Out[21]:
tensor([0, 0, 0, 0, 0])

Podsumujmy tensory

In [22]:
print('categorical_data: ',categorical_data.shape)
print('numerical_data:   ',numerical_data.shape)
print('outputs:          ',outputs.shape)
categorical_data:  torch.Size([29062, 8])
numerical_data:    torch.Size([29062, 3])
outputs:           torch.Size([29062])

OSADZANIE

Przekształciliśmy nasze kolumny kategorialne na numeryczne, w których unikatowa wartość jest reprezentowana przez jedną liczbę całkowitą (cyfryzacja – np. palący to 1). Na podstawie takiej kolumny (zmiennej) możemy wyszkolić model, jednak jest lepszy sposób…

Lepszym sposobem jest reprezentowanie wartości w kolumnie kategorialnej w postaci wektora N-wymiarowego zamiast pojedynczej liczby całkowitej. Ten proces nazywa się osadzaniem. Wektor jest w stanie przechwycić więcej informacji i może znaleźć związki między różnymi wartościami kategorycznymi w bardziej odpowiedni sposób. Dlatego będziemy reprezentować wartości w kolumnach kategorialnych w postaci wektorów N-wymiarowych.

Musimy zdefiniować rozmiar osadzania (wymiary wektorowe) dla wszystkich kolumn jakościowych. Nie ma twardej i szybkiej reguły dotyczącej liczby wymiarów. Dobrą zasadą przy definiowaniu rozmiaru osadzania dla kolumny jest podzielenie liczby unikalnych wartości w kolumnie przez 2 (ale nie więcej niż 50). Na przykład dla 'Smoking_Status’ kolumny liczba unikalnych wartości wynosi 3. Odpowiedni rozmiar osadzenia dla kolumny 'Smoking_Status’będzie wynosił 3/2 = 1,5 = 2 (zaokrąglenie).

Poniższy skrypt tworzy krotkę zawierającą liczbę unikalnych wartości i rozmiarów wymiarów dla wszystkich kolumn jakościowych.

Zasada jest prosta: macierz embadding musi być zawsze w ilości wierszy większa niż zakres zmiennych w ilości wierszy: dlatego dodałem col_size+2, to duży zapas..

In [23]:
categorical_column_sizes = [len(df[column].cat.categories) for column in categorical_columns]
categorical_embedding_sizes = [(col_size+2, min(50, (col_size+5)//2)) for col_size in categorical_column_sizes]
print(categorical_embedding_sizes)
[(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]

Dzielenie zestawu na szkoleniowy i testowy

In [24]:
total_records = df['ID'].count()
test_records = int(total_records * .2)

categorical_train_data = categorical_data[:total_records-test_records]
categorical_test_data = categorical_data[total_records-test_records:total_records]
numerical_train_data = numerical_data[:total_records-test_records]
numerical_test_data = numerical_data[total_records-test_records:total_records]
train_outputs = outputs[:total_records-test_records]
test_outputs = outputs[total_records-test_records:total_records]

Aby sprawdzić, czy poprawnie podzieliliśmy dane na zestawy treningów i testów, wydrukujmy długości rekordów szkolenia i testów:

In [25]:
print('categorical_train_data: ',categorical_train_data.shape)
print('numerical_train_data:   ',numerical_train_data.shape)
print('train_outputs:          ', train_outputs.shape)
print('----------------------------------------------------')
print('categorical_test_data:  ',categorical_test_data.shape)
print('numerical_test_data:    ',numerical_test_data.shape)
print('test_outputs:           ',test_outputs.shape)
categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
train_outputs:           torch.Size([23250])
----------------------------------------------------
categorical_test_data:   torch.Size([5812, 8])
numerical_test_data:     torch.Size([5812, 3])
test_outputs:            torch.Size([5812])

Tworzenie modelu klasyfikacji Pytorch

In [26]:
class Model(nn.Module):

    def __init__(self, embedding_size, num_numerical_cols, output_size, layers, p=0.4):
        super().__init__()
        self.all_embeddings = nn.ModuleList([nn.Embedding(ni, nf) for ni, nf in embedding_size])
        self.embedding_dropout = nn.Dropout(p)
        self.batch_norm_num = nn.BatchNorm1d(num_numerical_cols)

        all_layers = []
        num_categorical_cols = sum((nf for ni, nf in embedding_size))
        input_size = num_categorical_cols + num_numerical_cols

        for i in layers:
            all_layers.append(nn.Linear(input_size, i))
            all_layers.append(nn.ReLU(inplace=True))
            all_layers.append(nn.BatchNorm1d(i))
            all_layers.append(nn.Dropout(p))
            input_size = i

        all_layers.append(nn.Linear(layers[-1], output_size))

        self.layers = nn.Sequential(*all_layers)

    def forward(self, x_categorical, x_numerical):
        embeddings = []
        for i,e in enumerate(self.all_embeddings):
            embeddings.append(e(x_categorical[:,i]))
        x = torch.cat(embeddings, 1)
        x = self.embedding_dropout(x)

        x_numerical = self.batch_norm_num(x_numerical)
        x = torch.cat([x, x_numerical], 1)
        x = self.layers(x)
        return x
In [27]:
print('categorical_embedding_sizes:  ',categorical_embedding_sizes)
print(numerical_data.shape[1])
categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3
In [28]:
model = Model(categorical_embedding_sizes, numerical_data.shape[1], 2, [200,100,50], p=0.4)
In [29]:
print(model)
Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

Tworzenie dunkcji straty

In [30]:
#loss_function = torch.nn.MSELoss(reduction='sum')
loss_function = nn.CrossEntropyLoss()
#loss_function = nn.BCEWithLogitsLoss()

Definiowanie optymalizatora

In [31]:
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
#optimizer = torch.optim.SGD(model.parameters(), lr=0.001)
#optimizer = torch.optim.Rprop(model.parameters(), lr=0.001, etas=(0.5, 1.2), step_sizes=(1e-06, 50))
In [32]:
print('categorical_embedding_sizes:  ',categorical_embedding_sizes)
print(numerical_data.shape[1])
print('categorical_train_data: ',categorical_train_data.shape)
print('numerical_train_data:   ',numerical_train_data.shape)
print('outputs:                ',train_outputs.shape)
categorical_embedding_sizes:   [(4, 3), (4, 3), (4, 3), (4, 3), (7, 5), (4, 3), (5, 4), (12, 7)]
3
categorical_train_data:  torch.Size([23250, 8])
numerical_train_data:    torch.Size([23250, 3])
outputs:                 torch.Size([23250])
In [33]:
y_pred = model(categorical_train_data, numerical_train_data)
In [34]:
epochs = 300
aggregated_losses = []

for i in range(epochs):
    i += 1
    y_pred = model(categorical_train_data, numerical_train_data)
    
    single_loss = loss_function(y_pred, train_outputs)
    aggregated_losses.append(single_loss)

    if i
        print(f'epoch: {i:3} loss: {single_loss.item():10.8f}')

    optimizer.zero_grad()
    single_loss.backward()
    optimizer.step()

print(f'epoch: {i:3} loss: {single_loss.item():10.10f}')
epoch:   1 loss: 0.78628689
epoch:  31 loss: 0.61815375
epoch:  61 loss: 0.54198617
epoch:  91 loss: 0.42037284
epoch: 121 loss: 0.28172502
epoch: 151 loss: 0.16479240
epoch: 181 loss: 0.11258653
epoch: 211 loss: 0.10971391
epoch: 241 loss: 0.09200522
epoch: 271 loss: 0.08967553
epoch: 300 loss: 0.0847498849
In [35]:
plt.plot(range(epochs), aggregated_losses)
plt.ylabel('Loss')
plt.xlabel('epoch');

Prognoza na podstawie modelu

In [36]:
with torch.no_grad():
    y_val_train = model(categorical_train_data, numerical_train_data)
    loss = loss_function( y_val_train, train_outputs)
print(f'Loss train_set: {loss:.8f}')
Loss train_set: 0.08532631
In [37]:
with torch.no_grad():
    y_val = model(categorical_test_data, numerical_test_data)
    loss = loss_function(y_val, test_outputs)
print(f'Loss: {loss:.8f}')
Loss: 0.10788266

Ponieważ ustaliliśmy, że nasza warstwa wyjściowa będzie zawierać 2 neurony, każda prognoza będzie zawierać 2 wartości. Przykładowo pierwsze 5 przewidywanych wartości wygląda następująco:

In [38]:
print(y_val[:5])
tensor([[ 2.9134, -2.6896],
        [ 1.9811, -1.8706],
        [ 1.5829, -1.3088],
        [ 3.2281, -2.7011],
        [ 2.3756, -1.8374]])

Celem takich prognoz jest to, że jeśli rzeczywisty wynik wynosi 0, wartość przy indeksie 0 powinna być wyższa niż wartość przy indeksie 1 i odwrotnie. Możemy pobrać indeks największej wartości z listy za pomocą następującego skryptu:

In [39]:
y_val = np.argmax(y_val, axis=1)
In [ ]:
Powyższe równanie zwraca wskaźniki wartości maksymalnych wzdłuż osi.
In [41]:
print(y_val[:195])
tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0])

Ponieważ na liście pierwotnie przewidywanych wyników dla pierwszych pięciu rekordów wartości przy zerowych indeksach są większe niż wartości przy pierwszych indeksach, możemy zobaczyć 0 w pierwszych pięciu wierszach przetworzonych danych wyjściowych.

In [42]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(test_outputs,y_val))
print(classification_report(test_outputs,y_val))
print(accuracy_score(test_outputs, y_val))
[[5689    0]
 [ 123    0]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5689
           1       0.00      0.00      0.00       123

    accuracy                           0.98      5812
   macro avg       0.49      0.50      0.49      5812
weighted avg       0.96      0.98      0.97      5812

0.9788368891947694
/home/wojciech/anaconda3/lib/python3.7/site-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
  'precision', 'predicted', average, warn_for)

Model słabo wykrywa udar.

Zapisujemy cały model

In [43]:
torch.save(model,'/home/wojciech/Pulpit/3/byk.pb')
/home/wojciech/anaconda3/lib/python3.7/site-packages/torch/serialization.py:360: UserWarning: Couldn't retrieve source code for container of type Model. It won't be checked for correctness upon loading.
  "type " + obj.__name__ + ". It won't be checked "

Odtwarzamy cały model

In [44]:
KOT = torch.load('/home/wojciech/Pulpit/3/byk.pb')
KOT.eval()
Out[44]:
Model(
  (all_embeddings): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(4, 3)
    (2): Embedding(4, 3)
    (3): Embedding(4, 3)
    (4): Embedding(7, 5)
    (5): Embedding(4, 3)
    (6): Embedding(5, 4)
    (7): Embedding(12, 7)
  )
  (embedding_dropout): Dropout(p=0.4, inplace=False)
  (batch_norm_num): BatchNorm1d(3, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): Linear(in_features=34, out_features=200, bias=True)
    (1): ReLU(inplace=True)
    (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Dropout(p=0.4, inplace=False)
    (4): Linear(in_features=200, out_features=100, bias=True)
    (5): ReLU(inplace=True)
    (6): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): Dropout(p=0.4, inplace=False)
    (8): Linear(in_features=100, out_features=50, bias=True)
    (9): ReLU(inplace=True)
    (10): BatchNorm1d(50, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (11): Dropout(p=0.4, inplace=False)
    (12): Linear(in_features=50, out_features=2, bias=True)
  )
)

Podstawiając inne zmienne niezależne można uzyskać wektor zmiennych wyjściowych

In [45]:
A = categorical_train_data[::50]
A
Out[45]:
tensor([[1, 1, 0,  ..., 1, 1, 5],
        [0, 1, 0,  ..., 0, 1, 5],
        [1, 0, 0,  ..., 1, 1, 1],
        ...,
        [0, 0, 0,  ..., 1, 1, 9],
        [0, 0, 0,  ..., 1, 1, 1],
        [1, 0, 0,  ..., 1, 1, 9]])
In [46]:
B = numerical_train_data[::50]
B
Out[46]:
tensor([[ 87.9600,  39.2000,  58.0932],
        [235.8500,  40.1000,  57.0329],
        [ 80.8100,  33.2000,  34.1260],
        ...,
        [139.9900,  26.8000,  20.0493],
        [ 61.3100,  33.1000,  29.0575],
        [ 92.2200,  35.0000,  21.0603]])
In [47]:
y =train_outputs[::50]
In [51]:
y_pred_AB = KOT(A, B)
y_pred_AB[:10]
Out[51]:
tensor([[ 2.0236, -1.7279],
        [ 1.7250, -1.4366],
        [ 2.3650, -1.8698],
        [ 1.7376, -1.3888],
        [ 2.2012, -1.8784],
        [ 2.3820, -2.0166],
        [ 2.4481, -2.0788],
        [ 2.2441, -1.9123],
        [ 2.2570, -1.9358],
        [ 2.4153, -2.0513]], grad_fn=<SliceBackward>)
In [52]:
with torch.no_grad():
    y_val_AB = KOT(A,B)
    loss = loss_function( y_val_AB, y)
print(f'Loss train_set: {loss:.8f}')
Loss train_set: 0.10623065
In [53]:
y_val = np.argmax(y_val_AB, axis=1)
In [54]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y,y_val))
print(classification_report(y,y_val))
print(accuracy_score(y, y_val))
[[453   0]
 [ 12   0]]
              precision    recall  f1-score   support

           0       0.97      1.00      0.99       453
           1       0.00      0.00      0.00        12

    accuracy                           0.97       465
   macro avg       0.49      0.50      0.49       465
weighted avg       0.95      0.97      0.96       465

0.9741935483870968
In [56]:
print('Pomiar czasu wykonania tego zadania:')
print(time.time() - start_time) ## koniec pomiaru czasu
Pomiar czasu wykonania tego zadania:
1122.9189233779907

Artykuł Part_7 Stroke_Prediction – Model Sieci neuronowych PyTorch Technika Osadzania pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Part_1 Stroke_Prediction – Preparation of data for analysis https://sigmaquality.pl/uncategorized/part_1-stroke_prediction-preparation-of-data-for-analysis/ Fri, 06 Mar 2020 07:14:35 +0000 http://sigmaquality.pl/part_1-stroke_prediction-preparation-of-data-for-analysis/ In [1]: import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt df= pd.read_csv('c:/1/Stroke_Prediction.csv') df.head(5) Out[1]: ID Gender Age_In_Days [...]

Artykuł Part_1 Stroke_Prediction – Preparation of data for analysis pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

In [1]:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt


df= pd.read_csv('c:/1/Stroke_Prediction.csv')
df.head(5)
Out[1]:
ID Gender Age_In_Days Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke
0 31153 Male 1104.0 0 0 No children Rural 95.12 18.0 NaN 0
1 30650 Male 21204.0 1 0 Yes Private Urban 87.96 39.2 never smoked 0
2 17412 Female 2928.0 0 0 No Private Urban 110.89 17.6 NaN 0
3 57008 Female 25578.0 0 0 Yes Private Rural 69.04 35.9 formerly smoked 0
4 46657 Male 5128.0 0 0 No Never_worked Rural 161.28 19.1 NaN 0

1. Sprawdzenie kompletności i formatu danych

In [2]:
df.isnull().sum()
Out[2]:
ID                    0
Gender                0
Age_In_Days           0
Hypertension          0
Heart_Disease         0
Ever_Married          0
Type_Of_Work          0
Residence             0
Avg_Glucose           0
BMI                1462
Smoking_Status    13292
Stroke                0
dtype: int64

Brakuje danych dla:

- BMI
- Smoking_Status

Struktura braków: BMI i Smoking_Status
In [3]:
import seaborn as sns

print('obserwacji zmiennych: ',df.shape)
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
obserwacji zmiennych:  (43400, 12)
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x223eb4aa208>

Analiza BMI (Body Mass Index)

  • BMI<18,5 > nadwaga
  • 18,5<=BMI<=24,9 > waga prawidłowa
  • 25<=BMI <=29,9 > nadwaga
  • BMI>30 > otyłość
In [4]:
a = r'BMI = frac{masa}{{wzrost}^{2}}'
ax = plt.axes([0,0,0.3,0.3]) #left,bottom,width,height
ax.set_xticks([])
ax.set_yticks([])
ax.axis('off')
plt.text(0.4,0.4,'$
Source:  https://www.poradnikzdrowie.pl/sprawdz-sie/kalkulatory/kalkulator-wagi-bmi-aa-4Q8M-4h3E-dtKD.html

Sprawdzam, czy są błędne dane we wskaźniku BMI (Body Mass Index)

  • Wartość minimalna dla BMI: Zakładam, że ludzie mają minimalnie 100 cm wzrostu i ważą maksymalnie 300 kg
  • Wartość maksymalna dla BMI: Zakładam, że ludzie mają maksymalnie 230 cm wzrostu i ważą minimalnie 20 kg
In [5]:
max_BMI=400/(1*1)
min_BMI=30/(2.20*2.20)

print('max_BMI: ', max_BMI)
print('min_BMI: ', min_BMI)
max_BMI:  400.0
min_BMI:  6.198347107438016
In [6]:
df[(df['BMI']<=10)&(df['BMI']>=300)]
Out[6]:
ID Gender Age_In_Days Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke

Brak danych zafałszowanych w kolumnie BMI.

Sprawdzam, jaka jest struktura danych.

In [7]:
BMI1 = pd.qcut(df['BMI'],12)
BMI1.value_counts(dropna = False).sort_values(ascending=False).plot(kind='bar')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x223eb8396a0>
In [8]:
#df.BMI.value_counts(dropna = False)
In [9]:
import matplotlib.dates as mdates

fig, ax = plt.subplots()
df['BMI'].plot.kde(ax=ax, legend=False, title='Histogram: BMI')
df['BMI'].plot.hist(density=True, ax=ax)

ax.set_ylabel('Probability')
ax.grid(axis='y')
#ax.set_facecolor('#d8dcd6')
W mojej ocenie brak możliwości odtworzenia brakujących wartości BMI. Należy więc skasować rekordy z brakującymi wartościami BMI.

Analiza Smoking_Status

In [10]:
df['Smoking_Status'].value_counts(dropna = False)
Out[10]:
never smoked       16053
NaN                13292
formerly smoked     7493
smokes              6562
Name: Smoking_Status, dtype: int64

Podobnie nie ma możliwości uzupełnienia brakujących wartości zmienej niezależnej: Smoking_Status na podstawie zachowania pozostałych zmiennych niezależnych.

Zmienna ta musi mieć trzy stany:

  • never smoked,
  • formerly smoked,
  • smokes.

Pozostawienie czwartego stanu NaN byłoby błędem.

In [11]:
df['Smoking_Status'].value_counts(normalize=True,dropna = False)
Out[11]:
never smoked       0.369885
NaN                0.306267
formerly smoked    0.172650
smokes             0.151198
Name: Smoking_Status, dtype: float64

Jak ważna jest informacja 'Smoking_Status’ i 'BMI’ dla zmiennej wynikowej? Sprawdzam to, ponieważ istnieje możliwość eliminacji całych zmiennych. Przy okazji zbadamy korelacje pozostałych zmiennych egzogenicznych ze zmienną endogeniczną.

In [12]:
df['Ss_nowa'] = pd.Categorical(df['Smoking_Status']).codes

CORREL = df.corr().sort_values('Stroke')
print(CORREL['Stroke'])
del df['Ss_nowa']
ID               0.003067
Ss_nowa          0.019140
BMI              0.020285
Hypertension     0.075332
Avg_Glucose      0.078917
Heart_Disease    0.113763
Age_In_Days      0.153703
Stroke           1.000000
Name: Stroke, dtype: float64
W mojej ocenie brak możliwości odtworzenia brakujących wartości 'Smoking_Status’. Należy więc skasować rekordy z brakującymi wartościami 'Smoking_Status’ mimo, że jest ich aż 31

Kasuję wszystkie rekordy z brakami: 'Smoking_Status’ i 'BMI’

In [13]:
print('Przed kasowaniem: ',df.shape)
df = df.dropna(how='any')
print('Po kasowaniu:     ',df.shape)
Przed kasowaniem:  (43400, 12)
Po kasowaniu:      (29072, 12)
In [14]:
df.head(5)
Out[14]:
ID Gender Age_In_Days Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke
1 30650 Male 21204.0 1 0 Yes Private Urban 87.96 39.2 never smoked 0
3 57008 Female 25578.0 0 0 Yes Private Rural 69.04 35.9 formerly smoked 0
6 53725 Female 18995.0 0 0 Yes Private Urban 77.59 17.7 formerly smoked 0
7 41553 Female 27413.0 0 1 Yes Self-employed Rural 243.53 27.0 never smoked 0
8 16167 Female 11689.0 0 0 Yes Private Rural 77.67 32.3 smokes 0

Analiza Gender

In [15]:
df['Gender'].value_counts(normalize=True,dropna = False)
Out[15]:
Female    0.614062
Male      0.385698
Other     0.000241
Name: Gender, dtype: float64

Co to znaczy płeć 'Other’? Przedmiotem badania jest podatność na udar m.in pod kątem konkretnej płci. Płeć mózgu nie ma tu znaczenia. Przyjmuję, że 'Other’ to błąd danych i go kasuję. Pozostawienie trzeciego stanu 'Other’ byłoby błędem dla procesu klasyfikacji.

In [16]:
df['Gender'].replace('Other', np.nan, inplace=True)
print('Przed kasowaniem: ',df.shape)
df = df.dropna(how='any')
print('Po kasowaniu:     ',df.shape)
Przed kasowaniem:  (29072, 12)
Po kasowaniu:      (29065, 12)

Analiza Age_In_Days

Wiek człowieka analizujemy w latach. Jednocześnie ludzie starzeją się nierównomiernie. Dlatego wskazane jest analizować pacjentów według grup wiekowych.

In [17]:
df['Age_years']= df['Age_In_Days']/365

Sprawdzam, czy zmienna 'Age_years’ ma prawidłowe wartości. Okazuje się, że trzech pacjentów ma wiek powyżej 200 lat. Kasujemy te rekordy.

In [18]:
df[df['Age_years']>120]
Out[18]:
ID Gender Age_In_Days Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke Age_years
1342 58414 Female 85451.0 0 0 No Private Rural 65.30 22.1 smokes 0 234.112329
18177 31212 Female 117179.0 0 0 Yes Govt_job Rural 84.39 38.9 never smoked 0 321.038356
26716 70730 Female 79231.0 0 0 No Private Rural 77.62 23.1 formerly smoked 0 217.071233
In [19]:
df['Age_years'] = df['Age_years'].apply(lambda x: np.nan if x > 120 else x)
print('Przed kasowaniem: ',df.shape)
df = df.dropna(how='any')
print('Po kasowaniu:     ',df.shape)
df[df['Age_years']>120]
Przed kasowaniem:  (29065, 13)
Po kasowaniu:      (29062, 13)
Out[19]:
ID Gender Age_In_Days Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke Age_years

Podzieliłem wiek na 10 grup wiekowych. Kasuję kolumnę 'Age_In_Days’.

In [20]:
del df['Age_In_Days']
df['Age_years_10']= pd.qcut(df['Age_years'],10)
df['Age_years_10'].value_counts(normalize=True,dropna = False).sort_values(ascending=False).plot(kind='bar')
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x223eb831550>

Analiza Hypertension

In [21]:
df['Hypertension'].value_counts(normalize=True,dropna = False)
Out[21]:
0    0.88848
1    0.11152
Name: Hypertension, dtype: float64

Nadciśnienie występuje u 11

In [22]:
df['BMI_5']= pd.qcut(df['BMI'],5)
df.pivot_table(index='BMI_5', columns = 'Hypertension', values='Age_years',aggfunc='count')
Out[22]:
Hypertension 0 1
BMI_5
(10.099, 24.1] 5605 290
(24.1, 27.4] 5352 506
(27.4, 30.7] 5160 691
(30.7, 35.3] 4893 810
(35.3, 92.0] 4811 944
In [23]:
df['BMI_5'] = df['BMI_5'].astype(object)

plt.style.use('seaborn')

table=pd.crosstab(df['BMI_5'],df['Hypertension'])
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True, fontsize=14)
plt.title('BMI vs Hypertension', fontsize=20)
plt.xlabel('Grupy_BMI')
plt.ylabel('Proporcja')

del df['BMI_5']

Zmienna egzogeniczna 'Hypertension’ jest prawidłowa i zachowuje się zgodnie z powszechnym poglądem, że im większe BMI, tym większe nadciśnienie tętnicze.

Analiza Heart_Disease

In [24]:
df['Heart_Disease'].value_counts(normalize=True,dropna = False)
Out[24]:
0    0.947836
1    0.052164
Name: Heart_Disease, dtype: float64

Analiza Ever_Married

In [25]:
df['Ever_Married'].value_counts(normalize=True,dropna = False)
Out[25]:
Yes    0.746198
No     0.253802
Name: Ever_Married, dtype: float64

Analiza Type_Of_Work

In [26]:
df['Type_Of_Work'].value_counts(normalize=True,dropna = False)
Out[26]:
Private          0.651985
Self-employed    0.179065
Govt_job         0.144312
children         0.021162
Never_worked     0.003475
Name: Type_Of_Work, dtype: float64

Analiza Residence

In [27]:
df['Residence'].value_counts(normalize=True,dropna = False)
Out[27]:
Urban    0.502099
Rural    0.497901
Name: Residence, dtype: float64

Analiza Avg_Glucose

In [28]:
df['Avg_Glucose'].describe()
Out[28]:
count    29062.000000
mean       106.408801
std         45.273649
min         55.010000
25
50
75
max        291.050000
Name: Avg_Glucose, dtype: float64
In [29]:
df['Avg_Glucose'].plot.kde()
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x223ec35e438>

Rozkład gęstości prawdopodobieństwa zmiennej: 'Avg_Glucose’ posiada anomalie, której nie będziemy wyjaśniali na tym etapie badania.

Analiza Stroke

In [30]:
df['Stroke'].value_counts(normalize=True,dropna = False)
Out[30]:
0    0.981144
1    0.018856
Name: Stroke, dtype: float64

Zmienna wynikowa 'Stroke’ jest bardzo niezbilansowana. Z ciekawości zerkniemy, czy pojawia się jakiś wzór zależności zmiennej zależnej i zmiennych niezależnych.

Analiza ogólna zmiennych

Wykonuje się ją w ceu sprawdzenia, czy zachowanie zmiennych jest zgodne z ogólnie znaną wiedzą.

In [31]:
df.columns
Out[31]:
Index(['ID', 'Gender', 'Hypertension', 'Heart_Disease', 'Ever_Married',
       'Type_Of_Work', 'Residence', 'Avg_Glucose', 'BMI', 'Smoking_Status',
       'Stroke', 'Age_years', 'Age_years_10'],
      dtype='object')
In [32]:
kot = ["#c0c2ce", "#e40c2b"]
sns.pairplot(data=df[[ 'Avg_Glucose', 'BMI', 'Stroke', 'Age_years']], hue='Stroke', dropna=True, palette=kot)
C:ProgramDataAnaconda3libsite-packagesstatsmodelsnonparametrickde.py:488: RuntimeWarning: invalid value encountered in true_divide
  binned = fast_linbin(X, a, b, gridsize) / (delta * nobs)
C:ProgramDataAnaconda3libsite-packagesstatsmodelsnonparametrickdetools.py:34: RuntimeWarning: invalid value encountered in double_scalars
  FAC1 = 2*(np.pi*bw/RANGE)**2
Out[32]:
<seaborn.axisgrid.PairGrid at 0x223ec35e780>

Wstępna analiza danych ciągłych na powyższym wykresie wykazała, że:

  1. prawdopodobieństwo udaru rośnie wraz z wiekiem,
  2. udar najczęściej występuje w przedziale BMI 20-50,
  3. poziom glukozy wydaje się nie mieć znaczenia.

Zachowanie zmiennych potwierdza ogólnie znaną wiedzę.

Zapisanie oczyszczonego i poprawionego zbioru danych do dalszych analiz

In [33]:
df.head(3)
Out[33]:
ID Gender Hypertension Heart_Disease Ever_Married Type_Of_Work Residence Avg_Glucose BMI Smoking_Status Stroke Age_years Age_years_10
1 30650 Male 1 0 Yes Private Urban 87.96 39.2 never smoked 0 58.093151 (53.126, 59.076]
3 57008 Female 0 0 Yes Private Rural 69.04 35.9 formerly smoked 0 70.076712 (65.121, 74.11]
6 53725 Female 0 0 Yes Private Urban 77.59 17.7 formerly smoked 0 52.041096 (48.082, 53.126]
In [34]:
df.isnull().sum()
Out[34]:
ID                0
Gender            0
Hypertension      0
Heart_Disease     0
Ever_Married      0
Type_Of_Work      0
Residence         0
Avg_Glucose       0
BMI               0
Smoking_Status    0
Stroke            0
Age_years         0
Age_years_10      0
dtype: int64
In [35]:
df.to_csv('c:/1/Stroke_Prediction_CLEAR.csv')

End of Part_1: Stroke_Prediction – Preparation of data for analysis

Part_2 Stroke_Prediction – Preparation of data for the classification process

Artykuł Part_1 Stroke_Prediction – Preparation of data for analysis pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>