Analyzing of the incidence of diabetes. Random Forest method

Application of Machine Learning in clinical trials

Our aim is to build machine learning model to predict of the incidence of a diabetic. Survey with description that we use in our investigation you can find here: https://www.kaggle.com/kumargh/pimaindiansdiabetescsv

We isolated from the model column contains level of glucose in the blood. This exogenous variable was characterized too big correlation with endogenic value (result value). Existing of very good estimator can dominate over little poor estimators.

More over variable glucose did bring nothing to the model. High level of glucose indicates existence of diabetes. It is not factor who causes this illness.

Needed libraries were launched. Next I display first 5 rows of the data.

import pandas as pd
import numpy as np

df = pd.read_csv('c:/1/diabetes.csv', usecols=['Pregnancies', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age', 'Outcome'])
df.head(5)

As we can easily remark, database has inaccuracies. For example, it is impossible to patient having zero millimeters thick skin. Let’s leave everything as it is. At the moment we don’t correct it.

X = df.drop('Outcome', axis=1) 
y = df['Outcome']

Preparation of the data set to modeling

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)

[su_highlight]test_size[/su_highlight]

Parameter ‘test_size’: We assume 80% of the data as data for training, next 20% of data will be used to tests.

Parameter ‘random_state’ determines stability. It doesn’t matter what number we put there.

If this parameter is absence in code, algorithm will be generated always another random values. This is no good for stable of the model.

[su_highlight]stratify[/su_highlight]

Parameter 'stratify=y’ causing that structure of trial data are same as structure of population. Thanks this parameter proportion of specific features will be same in trial and in the population.

Pipeline as a connection of transforming and estimating

Pipeline is a module who combine process of data standardization and estimation.

[su_highlight]preprocessing.StandardScaler[/su_highlight]

Transformation by parameter: 'preprocessing.StandardScaler()’ consist in data standardization to form a normal distribution, where mean is 0 and standar deviation 1.

[su_highlight]RandomForestRegressor[/su_highlight]

Estimator: 'RandomForestRegressor(n_estimators=100)’ estimates independent variables in to the course of the dependent variable.

Estimator is set on the level 100. It means one hundred decision trees. We can set more trees but this may delay the receipt of the result without significant improve the predictive abilities.

from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor

pipeline = make_pipeline(preprocessing.StandardScaler(),
    RandomForestRegressor(n_estimators=100))

Choosing of the hyperparameters

Now we create hyperparameters for tuning model. In this part of code we declare which kind of tools we use to this operation.

hyperparameters = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
                   'randomforestregressor__max_depth': [None, 5, 3, 1]}

[su_highlight]max_features[/su_highlight]

The parameter: 'max_features’ specifies number of functions for process of looking divisions.

We have three methods of looking for division: [’auto’, 'sqrt’, 'log2′]. The program automatically choose the best method.

max_features: „auto” no subset of features is performed on trees, so „random forest” is actually a packed set of ordinary regression trees. With the AUTO setting, the code will take all the functions that make sense from each tree. Here, no restrictions are placed on individual trees.

max_features: „sqrt”, This option takes the square root of the total number of functions in a single pass. For example, if the total number of variables is 100, we can only take 10 of them in a single tree.

[su_highlight]max_depth indicates deep of the decision tree.[/su_highlight]

Next line of code leads to the process of tuning of prediction algorithm. Object 'fit’ this is tuning function.

from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)

Code above is response for good tuning of the model. From the great number of interactions code chooses the best variant.

Exist two method of searching the best hyperparameters for tuning of the model:

Searching by the grid
Random searching

[su_highlight]GridSearchCV[/su_highlight] – looking for hyperparameters by the grid is a nonrandom but systematically.

Let’s make small refresh of information

By pipeline we declared method of data transformation: StandardScaler() and method of estimation: RandomForestRegressor()

In the next step we pointed hyperparameters: max_features and max_depth.

Code depend how good operate classification on the data assembly.

Come time to check how good our model explain reality!

Our aim was make model on the base of the value of variables from medical investigations.

We wanted to find which of the variables are the good or best predictors. Our model shows 0 when patient is healthy and 1 when has disease.

Code below answers, which method for Random Forest method was chosen as the best method of model tuning.

import pprint
pparam=pprint.PrettyPrinter(indent=2)
print(clf.best_params_)

Our model doesn’t make binary answer for ours questions. We can change it by the short code.

y_pred = clf.predict(X_test)
y_pred = np.round(y_pred, decimals=0)

Model evaluation by Confusion Matrix

To the evaluation of our model we can use Confusion Matrix.

Matrix answer how good is the model. Compare answers from the test assemble with the answer from the model prediction.

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(y_test, y_pred)
co_matrix

Matrix has two dimension because model generate binary answers.

How to interpret the Confusion Matrix?

Matrix points how many times model pointed good answer, how many times made mistake.

Imagine that it is the scheme of a Confusion Matrix.

Numbers in the white plots it is good answers, numbers in the black plots it is mistakes.

Intuitively we will be better evaluate models who have significantly bigger values in white plots than in the black plots.

We can use special indicator to good understand model goodness.

Confusion Matrix indicators

[su_highlight]Accuracy ACC[/su_highlight]of the model is interpreted as the accuracy of the classification. Is calculated as the sum of the numbers on the white areas divided by sum of the all numbers in the matrix. The higher the percentage value the better.

ACC = A+D / A+B+C+D.

(87+20) / (87+20+13+34) = 69%

[su_highlight]Precision or positive predictive value (PPV)[/su_highlight] level of verifiability of model predictions. Is calculated as: PPV = A / A+B.

87 / (87+13) = 87%

[su_highlight]Sensitivity, recall, hit rate, or true positive rate (TPR)[/su_highlight]– How many current patients with illness were detected by the model as the patients with illness? Is calculated as: TPR = A / A+C.

87 / (87+34) = 72%

F-Score

It is difficult compare models with low level of precision and high level of recall and vice versa.

This one compares simultaneously recall and precision.

F-Score = (2* Precision* Recall)/( Precision+ Recall) = 2A/(2A+B+C)

2*87 / (2*87 + 34 + 13) = 79%

All this calculation we can acquire by code below:

print(classification_report(y_test, y_pred))

At the end we can save model on the disc.

joblib.dump(clf, 'c:/1/rf_abc.pkl')

To open model you can use such command.

clf2 = joblib.load('c:/1/rf_abc.pkl')

Entire code

import pandas as pd
import numpy as np

df = pd.read_csv('c:/1/diabetes.csv', usecols=['Pregnancies', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age', 'Outcome'])
df.head(5)

# wskazanie danych objaśniających i wynikowych
X = df.drop('Outcome', axis=1) #<-- wszystkie kolumny poza CLASS są zmiennymi opisującymi X
y = df['Outcome']              #<-- Class jest zmienną opisywaną y


#df[['Pregnancies', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']] = df[['Pregnancies', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'Age']].replace(0,np.nan)
#df = df.dropna(how='any')

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123,stratify=y)

from sklearn.pipeline import make_pipeline
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor

pipeline = make_pipeline(preprocessing.StandardScaler(),
    RandomForestRegressor(n_estimators=100))

hyperparameters = {'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'],
                   'randomforestregressor__max_depth': [None, 5, 3, 1]}

from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
y_pred = np.round(y_pred, decimals=0)

from sklearn.metrics import classification_report, confusion_matrix
from sklearn import metrics

co_matrix = metrics.confusion_matrix(y_test, y_pred)
co_matrix

print(classification_report(y_test, y_pred)) 

print("Accuracy: ",np.round(metrics.accuracy_score(y_test, y_pred), decimals=2))
print("Precision:",np.round(metrics.precision_score(y_test, y_pred), decimals=2))
print("Recall:   ",np.round(metrics.recall_score(y_test, y_pred), decimals=2))

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński