medicine Pandas - THE DATA SCIENCE LIBRARY http://sigmaquality.pl/tag/medicine-pandas/ Wojciech Moszczyński Mon, 04 Sep 2017 06:32:00 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png medicine Pandas - THE DATA SCIENCE LIBRARY http://sigmaquality.pl/tag/medicine-pandas/ 32 32 Continues of research on the diabetes causes using Pandas https://sigmaquality.pl/uncategorized/continues-of-research-on-the-diabetes-causes-using-pandas/ Mon, 04 Sep 2017 06:32:00 +0000 http://sigmaquality.pl/?p=5537 Today we continue research on sample trying to answer about causing of diabetes. You can find this sample here: First we download base to the [...]

Artykuł Continues of research on the diabetes causes using Pandas pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>

Today we continue research on sample trying to answer about causing of diabetes. You can find this sample here:

First we download base to the Pandas.

import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv('c:/1/diabetes.csv')
df.head(3)

I don't know why, I decided to check what is the correlation between level of glucose and number of pregnancies.

df.groupby('Pregnancies').Glucose.mean().plot(kind='bar')

Please to remark, in many category in the trial such as blood pressure or skin thickness we find zero.

Beyond doubt we need to replace zero into NaN because formulas (for example mean) take zero as a number. NaN is not counted, doesnt take in to calculation.

df[['Glucose','BloodPressure','SkinThickness', 'Insulin','BMI']] = df[['Glucose','BloodPressure','SkinThickness', 'Insulin','BMI']].replace(0,np.nan)
df.groupby('Pregnancies').Glucose.mean().plot(kind='bar')

Let's check what is statistic characteristic of Pregnancies.

df.Pregnancies.agg(['max','min', 'mean', 'std'])

Good to know what is the structure of pregnancies in our sample.

df.Pregnancies.value_counts()

As we see most popular is zero,one or two pregnancies.

df.Pregnancies.plot.kde( legend=False, title='Pregnancies')

Last time I decided that 'Glucose' driver is too much correlated to dependent variable I get it out from the machine learning modes.
I see 'Glucose' can be good predictor for future application.

CORREL = df.corr().sort_values('Outcome',ascending=False)
CORREL['Outcome']

df.groupby('Outcome').Glucose.mean().plot(kind='bar')

Clearly see, patients with diabetes have higher level glucose as healthy patients.

It is interesting which variables influence on the high level of glucose.

CORREL = df.corr().sort_values('Glucose',ascending=False)
CORREL['Glucose']

It seems insulin has most significant influence on the glucosa.

It is pity, my specialisations is not medicine.

df2= df[['Glucose','Insulin','DiabetesPedigreeFunction','BloodPressure','SkinThickness','BMI','Pregnancies','Age']]
CORR2=df2.corr()
CORR2

sns.heatmap(CORR2, annot=True, cbar=False, cmap="coolwarm")

Artykuł Continues of research on the diabetes causes using Pandas pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>