Continues of research on the diabetes causes using Pandas

Today we continue research on sample trying to answer about causing of diabetes. You can find this sample here:

First we download base to the Pandas.

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('c:/1/diabetes.csv')
df.head(3)

I don't know why, I decided to check what is the correlation between level of glucose and number of pregnancies.

df.groupby('Pregnancies').Glucose.mean().plot(kind='bar')

Please to remark, in many category in the trial such as blood pressure or skin thickness we find zero.

Beyond doubt we need to replace zero into NaN because formulas (for example mean) take zero as a number. NaN is not counted, doesnt take in to calculation.

df[['Glucose','BloodPressure','SkinThickness', 'Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness', 'Insulin','BMI']].replace(0,np.nan)

df.groupby('Pregnancies').Glucose.mean().plot(kind='bar')

Let's check what is statistic characteristic of Pregnancies.

df.Pregnancies.agg(['max','min', 'mean', 'std'])

Good to know what is the structure of pregnancies in our sample.

df.Pregnancies.value_counts()

As we see most popular is zero,one or two pregnancies.

df.Pregnancies.plot.kde( legend=False, title='Pregnancies')

Last time I decided that 'Glucose' driver is too much correlated to dependent variable I get it out from the machine learning modes.
I see 'Glucose' can be good predictor for future application.

CORREL = df.corr().sort_values('Outcome',ascending=False)
CORREL['Outcome']

df.groupby('Outcome').Glucose.mean().plot(kind='bar')

Clearly see, patients with diabetes have higher level glucose as healthy patients.

It is interesting which variables influence on the high level of glucose.

CORREL = df.corr().sort_values('Glucose',ascending=False)
CORREL['Glucose']

It seems insulin has most significant influence on the glucosa.

It is pity, my specialisations is not medicine.

df2= df[['Glucose','Insulin','DiabetesPedigreeFunction','BloodPressure','SkinThickness','BMI','Pregnancies','Age']]
CORR2=df2.corr()
CORR2

sns.heatmap(CORR2, annot=True, cbar=False, cmap="coolwarm")

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Continues of research on the diabetes causes using Pandas