Introduction
Logistic regression is algorithm of classification machine learning. Model predicts binary state of dependent variable. Dependent result variable takes value from 0 to 1.
Main principles of logistic regression model
- Dependent variable has a binary form.
- Model contain only variables who have significant influence on the result.
- There are no collinearity among independent variables (no correlation among predictors).
- Logistic model needs big number of observations.
Let’s open data with needed libraries. We will be working on the registry of bank cards operations. In colum ‘Class’ value 0 mean: lack of fraud in transaction, value 1 point embezzlement. Let’s take an assumption, main aim is to correctly classification of transactions with embezzlement. We can unfortunately certificate good transaction as the transaction with fraud.
## Procedura Logistic Regression DLA ZMIENNYCH OPISUJĄCYCH CIĄGŁYCH
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler, Normalizer, scale
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import confusion_matrix, log_loss, auc, roc_curve, roc_auc_score, recall_score, precision_recall_curve
from sklearn.metrics import make_scorer, precision_score, fbeta_score, f1_score, classification_report
from sklearn.model_selection import cross_val_score, train_test_split, KFold, StratifiedShuffleSplit, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
df = pd.read_csv('c:/1/creditcard.csv')
df.head(3)
df.columns
Analyze of the level of balance of the dependent variables
Assembly of variable is balanced when values 1 and 0 have likely number of occurrences. Result variables assembly is unbalanced because the subject of investigation rare phenomenon (showed as 1). Variable 1 is appeared one of the 1000 transaction marked as 0. When the variable assembly is unbalanced, then we can appear phenomenon, where model would be ignored minority variable (marked as 1). Such model has very high level of ‘recall’ ratios, despite ignore even all 1 variables.
We check out level of unbalanced in registry.
df.Class.value_counts(dropna = False)
sns.countplot(x='Class',data=df, palette='GnBu_d')
plt.show()
The plot is not very readable. We make the percentage structure.
Analyze of independent variables in the logistic regression model
Let’s remind two important assumptions for logistic regression.
- Inmodel, we ought to took only independent variables who have significant influence on the result variable.
- Independent variable should be mutual non correlated.
In the first stage we check how independent variables influence on the result variable.
df.Class.value_counts(dropna = False, normalize=True)
I check what the amount on transaction with fraud and on regular transaction.
df.groupby('Class').Amount.mean()
df.Amount.agg(['min','max','mean','std']).astype(int)
As we see scale of fraud, represented as 1 in column ‘Class’, poses only 0,46% of all transactions. So, the assembly is deeply unbalanced.
In the first stage we check how independent variables influence on the result variable.
CORREL = df.corr().sort_values('Class')
CORREL['Class']
Correlation vector show, that some of the independent variables have small or none influence on the result.
Statistical analysis of independent variables
We check how is the difference between 1 and 0 in the ‘Amount’ and ‘Time’ of transactions.
pd.pivot_table(df, index='Class', values = 'Amount', aggfunc= [np.mean, np.median, min, max, np.std] )
pd.pivot_table(df, index='Class', values = 'Time', aggfunc= [np.mean, np.median, min, max, np.std])
We point the columns who have any correlation upper than 0.4 with dependent variable.
kot = CORREL[(CORREL['Class']>0.4)|(CORREL['Class']<-0.3)][['Class']]
kot.index
We collected only variables strong correlated with result variable.
CORREL = df[['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1', 'Class']].corr().sort_values('Class')
CORREL['Class']
We rather don’t use ‘Amount’ to the model. For example, we do standardization of this category.
#scaler = StandardScaler()
#df['Amount'] = scaler.fit_transform(df['Amount'].values.reshape(-1, 1))
Let’s check correlation among independent variables.
sns.heatmap (df.corr (), cmap="coolwarm")
Before we start compose model we will need to do order in data.
- We remove columns what we will do not use.
- We remove records with the lack (empty) values.
- We do not standardized data.
- We check statistic parametrs of the data.
#df.drop(columns = 'Time', inplace = True)
#del df['Amount']
df.isnull().sum()
df = df.dropna(how='any')
df.agg(['min','max','mean','std'])[['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1', 'Class']]
sns.heatmap (df[['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1']].corr(), cmap="YlGnBu", annot=True, cbar=False)
We can find any significant mutual correlation between independent variables. Before we start creating the model we ought to remove all records with empty cells.
Creating model of logistic regression
We declare where is independent, descriptive variables and dependent, result variable. We point the divide of training and test assemblies.
feature_cols = ['V3', 'V14', 'V17', 'V7', 'V10', 'V16', 'V12', 'V1']
X = df[feature_cols]
y = df.Class
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size = .33, stratify = y, random_state = 148)
We are configure settings for grid.
Parameteres = {'C': np.power(10.0, np.arange(-3, 3))}
LR = LogisticRegression(warm_start = True)
LR_Grid = GridSearchCV(LR, param_grid = Parameteres, scoring = 'roc_auc', n_jobs = 5, cv=2)
Explanation for the code:
Parameteres¶
Parameteres = {‘C’: np.power(10.0, np.arange(-3, 3))}
array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02])
We have tipical setting for the grid.
warm_start
Using ‘warm_start=True’ causes utilization of last model setting to next run of the model. Thank to this model speeds up time of finding convergence. Parameter warm_start is useful in making multiple convergences with the same model using various settings.
scoring = ‘roc_auc’¶
The ROC plot estimates the best of setting of classification. Finding the area under the ROC curve is the most popular method of evaluation of classification efficiency by the grid.
jobs = 5
Number of tasks running simultaneously
cv = 2
Number of cross verifications.
Model takes the form of the equation:
LR_Grid.fit(Xtrain, ytrain)
We check what of supermatamiters have been chosen as the best by the grid.
print("The Best parameter:",LR_Grid.best_params_)
print("The Best estimator:",LR_Grid.best_estimator_)
Evaluation of the classification of the logistic regression
We use diagnostic block, we put it into the code.
print("n------Training data---------------------------------------------------")
print("The RECALL Training data: ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print("The PRECISION Training data: ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print()
print("------Test data-------------------------------------------------------")
print("The RECALL Test data is: ", np.round(recall_score(ytest, LR_Grid.predict(Xtest)), decimals=3))
print("The PRECISION Test data is: ", np.round(precision_score(ytest, LR_Grid.predict(Xtest)), decimals=3))
print()
print("The Confusion Matrix Test data :--------------------------------------")
print(confusion_matrix(ytest, LR_Grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LR_Grid.predict(Xtest)))
# PLOT
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest, y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = #plt.axvline(0.5, color = '#00C851', linestyle = '--')
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
The ‘Recall’ gauge increased to the level of 0.62, ratios ‘precision’ fall to the level 0.67.
Changing of the threshold on the logistic regression
Changing of the threshold on the logistic regression increase level of ‘recall’ in the expense of level of accuracy shown by the ‘precision’ ratio. Let’s remind, bank track all fraud and defraud nations made on the credit cards. Cost of false accusation, when the model call clear transaction as the fraud is relatively small. Let’s remind, bank track all fraud and defraud nations made on the credit cards. Cost of false accusation, when the model call clear transaction as the fraud is relatively small. Bank especially interested in find fraud at all cost.
In the logistic regression model default threshold is 0,5%. We move threshold to the level of 0,1%.
LR_Grid_ytest = LR_Grid.predict_proba(Xtest)[:, 1]
new_threshold = 0.1
ytest_pred = (LR_Grid_ytest >= new_threshold).astype(int)
ytest_pred
We launch diagnostic module.
print("n------Training data---------------------------------------------------")
print("RECALL Training data (new_threshold = 0.1): ", np.round(recall_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print("PRECISION Training data (new_threshold = 0.1): ", np.round(precision_score(ytrain, LR_Grid.predict(Xtrain)), decimals=3))
print("------Test data-------------------------------------------------------")
print("RECALL Test data (new_threshold = 0.1): ", np.round(recall_score(ytest, ytest_pred), decimals=3))
print("PRECISION Test data (new_threshold = 0.1): ", np.round(precision_score(ytest, ytest_pred), decimals=3))
print()
print("The Confusion Matrix Test data (new_threshold = 0.1):-----------------")
print(confusion_matrix(ytest, ytest_pred))
print("----------------------------------------------------------------------")
print(classification_report(ytest, ytest_pred))
# WYKRES-------------------------------------------
y_pred_proba = LR_Grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest, y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = plt.axvline(0.1, color = '#00C251', linestyle = '--', label = 'threshold = 0.1')
plt.axvline(0.5, color = 'grey', linestyle = '--', label = 'threshold = 0.5')
plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
The ‘Recall’ gauge increased to the level of 0.73, ratios ‘precision’ fall to the level 0.58.
Oversampling
Oversampling is the method of partial removal of effects of unbalanced set of result variables. Method is use for the train set to training the model. When the result registry is not balanced, model have tendency to avoid rare results (1) for the benefit of frequent results (0). Frankly speaking all models have tendency to generalize of the reality. From the bank point of view it is more important to point malversation, despite the model can call proper transaction as fraud, than avoid couple of fraud in the registry. Sensitization of the model for the rare variables is possible by the oversampling and by the move of border of probability. Model sensitization on the frauds move down level of ‘Precision’ on the benefit of ‘recall’ ratio.
Oversampling by the cloning
Random oversampling consist in supplement minority variables by the copy of this minority variables. Oversampling (copying) can be create more than one times (2x, 3x, 5x, 10x et cetera)
Undersampling by the eliminate
Random undersample consist on elimination of samples from the class of majority (class 0) with the exchange or not. This is one of the earliest techniq of eliminate unbalancing in datasets. Undersampling can increase of variance of the clasificator and teoretycaly trow out usefull variables.
Source of data: https://en.wikipedia.org/wiki/Oversampling_and_undersampling_in_data_analysis
Procedure of oversampling
Let’s remind how unbalanced is our dataset.
df.Class.value_counts(dropna = False, normalize=True)
Minority variables poses only 0.46% of all result variables. It means that for one transaction with the fraud (marked 1) falls 217 clear transaction (marked 0). Let’s try to imitate this dependencies on the training set.
print("ytrain = 0: ", sum(ytrain == 0))
print("ytrain = 1: ", sum(ytrain == 1))
###### OVERSAMPLING #######################################
OVS_gauge = sum(ytrain == 0) / sum(ytrain == 1)
OVS_gauge = np.round(OVS_gauge, decimals=0)
OVS_gauge = OVS_gauge.astype(int)
OVS_gauge
In the training set for the one fraud falls 216 regular transaction. As we see train set remains with the same proportion to the all set thanks to parameter (stratify = y) in equation who define division in training set and test set. Now we can increase by 217 number of dependent variables marked as 1.
ytrain_pos_OVS = pd.concat([ytrain[ytrain==1]] * OVS_gauge, axis = 0)
ytrain_pos_OVS.count()
This number is the sum of fraud variables in the training set (there are 54 such variables) multiplied by 216. Now we ought to do it for all independent variables X everywhere where result y was 1.
#Xtrain.loc[ytrain==1, :]
Xtrain.count()
Xtrain.loc[ytrain==1, :].count()
This records with y=1 should be multiplied 216 times.
Xtrain_pos_OVS = pd.concat([Xtrain.loc[ytrain==1, :]] * OVS_gauge, axis = 0)
Now we enter new, additional variables to the training set.
# concat the repeated data with the original data together
ytrain_OVS = pd.concat([ytrain, ytrain_pos_OVS], axis = 0).reset_index(drop = True)
Xtrain_OVS = pd.concat([Xtrain, Xtrain_pos_OVS], axis = 0).reset_index(drop = True)
At the beginning of the study we had 11711 records in dataset.
Xtrain_pos_OVS.count()
ytrain_OVS.count()
Now we put the new dataset after oversampling to the grid, and conduct process of grid.
Parametry2 = {'C': np.power(10.0, np.arange(-3, 3))}
OVS_reg = LogisticRegression(warm_start = True, solver='lbfgs')
OVS_grid = GridSearchCV(OVS_reg, param_grid = Parametry2, scoring = 'roc_auc', n_jobs = 5, cv = 6)
OVS_grid.fit(Xtrain_OVS, ytrain_OVS)
Now we use diagnostic block.
print()
print("Recall Training data: ", np.round(recall_score(ytrain_OVS, OVS_grid.predict(Xtrain_OVS)), decimals=4))
print("Precision Training data: ", np.round(precision_score(ytrain_OVS, OVS_grid.predict(Xtrain_OVS)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data: ", np.round(recall_score(ytest, OVS_grid.predict(Xtest)), decimals=4))
print("Precision Test data: ", np.round(precision_score(ytest, OVS_grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, OVS_grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, OVS_grid.predict(Xtest)))
y_pred_proba = OVS_grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest, y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
The ‘Recall’ gauge increased to the level of 0.92, ratios ‘precision’ fall to the level 0.21.
Class_weight
Using class_weight is the method to improve ‘recall’ ratio in the unbalanced sets (unbalanced in result variable). Increase of ‘recall’ is in the expense of ‘precision’ ratio.
As we mentioned, in train set there 216 of variables marked 0 for one variable marked 1. To limit this disproportion we need to increase weigh of dependent variables marked 1 by 216 times.
Pw = sum(ytrain == 0) / sum(ytrain == 1) # size to repeat y == 1
Pw = np.round(Pw, decimals=0)
Pw = Pw.astype(int)
Pw
Weight parameter ‘positive weight’, PW=216
We need to increase of variable 1, we need in the equation use range {0: 1, 2:216}.
Parameters = {'C': np.power(10.0, np.arange(-3, 3))}
LogReg = LogisticRegression(class_weight = {0 : 1, 1 : Pw}, warm_start = True, solver='lbfgs')
Tuning the model by the grid.
LRV_Reg_grid = GridSearchCV(LogReg, param_grid = Parameters, scoring = 'roc_auc', n_jobs = 5, cv = 6)
LRV_Reg_grid.fit(Xtrain, ytrain)
We check which hyperparameters were chosen.
print("The Best parameter:",LRV_Reg_grid.best_params_)
print("The Best estimator:",LRV_Reg_grid.best_estimator_)
As usually turn on diagnostic module.
print()
print("Recall Training data: ", np.round(recall_score(ytrain, LRV_Reg_grid.predict(Xtrain)), decimals=4))
print("Precision Training data: ", np.round(precision_score(ytrain, LRV_Reg_grid.predict(Xtrain)), decimals=4))
print("----------------------------------------------------------------------")
print("Recall Test data: ", np.round(recall_score(ytest, LRV_Reg_grid.predict(Xtest)), decimals=4))
print("Precision Test data: ", np.round(precision_score(ytest, LRV_Reg_grid.predict(Xtest)), decimals=4))
print("----------------------------------------------------------------------")
print("Confusion Matrix Test data")
print(confusion_matrix(ytest, OVS_grid.predict(Xtest)))
print("----------------------------------------------------------------------")
print(classification_report(ytest, LRV_Reg_grid.predict(Xtest)))
y_pred_proba = LRV_Reg_grid.predict_proba(Xtest)[::,1]
fpr, tpr, _ = metrics.roc_curve(ytest, y_pred_proba)
auc = metrics.roc_auc_score(ytest, y_pred_proba)
plt.plot(fpr, tpr, label='Logistic Regression (auc = plt.xlabel('False Positive Rate',color='grey', fontsize = 13)
plt.ylabel('True Positive Rate',color='grey', fontsize = 13)
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.legend(loc=4)
plt.plot([0, 1], [0, 1],'r--')
plt.show()
Thanks to apply class_weight, weight of dependent variables marked 1 was increased.
The ‘Recall’ gauge increased to the level of 0.88, ratios ‘precision’ fall to the level 0.28.
The same effect, can be achieved using automatically balancing with parameter class_weight=”balanced”.