The Pearson coefficient is used to check the correlation between two continuous variables.

Data source: https://www.kaggle.com/saurabh00007/diabetescsv

import pandas as pd
df = pd.read_csv('c:/1/diabetes.csv')
df.head(3)

The aim of the study is to check if (H1) there is a statistically significant positive correlation between blood pressure and body weight of patients.

H0 – there is no significant statistical relationship between body weight and blood pressure
H1 – there is a significant statistical relationship between body weight and blood pressure

Table of Contents

Data completeness check ¶

It is noted that some information such as blood pressure or body temperature is zero. It treats as no data.

df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

The data is complete but there is a lack of data in the form of zeros. I replace these zeros with shortcomings.

import numpy as np

df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI']] = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI']].replace(0,np.NaN)

print('Number of rows and columns: ',df.shape)
df.isnull().sum()

Number of rows and columns:  (768, 9)

Pregnancies                 111
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

Number of rows and columns:  (336, 9)

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

<seaborn.axisgrid.JointGrid at 0x1de283ea940>

<matplotlib.axes._subplots.AxesSubplot at 0x1de2afd8240>

<seaborn.axisgrid.JointGrid at 0x1de200c8198>

Pregnancies                 float64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

[0.27494 0.     ]

Text(0.5, 1.0, 'Scatterplot for blood insulin levels and BMI')

Deleting records with missing data.

df = df.dropna(how ='any')

print('Number of rows and columns: ',df.shape)
df.isnull().sum()

Number of rows and columns:  (336, 9)

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

<seaborn.axisgrid.JointGrid at 0x1de283ea940>

<matplotlib.axes._subplots.AxesSubplot at 0x1de2afd8240>

<seaborn.axisgrid.JointGrid at 0x1de200c8198>

Pregnancies                 float64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

[0.27494 0.     ]

Text(0.5, 1.0, 'Scatterplot for blood insulin levels and BMI')

Now there is much less data but better quality.

Pearson correlation coefficient¶

The Pearson correlation coefficient [1] measures the linear relationship between two datasets. The calculation of the p-value relies on the assumption that each dataset is normally distributed. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation.

Pearson correlation coefficient returns a double tuple consisting of a correlation coefficient and the corresponding p value:
The correlation coefficient can be from -1 to +1.
The null hypothesis is that two variables are uncorrelated. The p value is a number from zero to one that represents the probability that your data would have arisen if the null hypothesis were true.

p- value¶

A low p-value (such as 0.01) is considered evidence that the null hypothesis can be „Rejected”. Statisticians say p-value of 0.01 is „very significant” or say „data is significant at 0.01”
A competent researcher studying a hypothetical relationship will determine the p value before empirical study. Usually 0.01 or 0.05 are used. If the test data give a p value less than the predetermined value, the researcher claims that their study is significant and allows them to reject the null hypothesis and conclude that the relationship really exists.
In the case of the relationship between alcohol consumption and breast cancer, the correlation coefficient is about 0.4, with a very low p value – almost equal to zero. This tells us that the relationship is statistically significant (because it is less than 0.05 or 0.01)

Assuming a normal distribution¶

It is desirable that the correlated variables have distributions close to the normal distribution. However, it is not obligatory. We can also calculate Pearson’s r factor for variables with a broken assumption with normal distribution. Attention should also be paid to outliers that may disturb the obtained correlation result.

A scatter plot made to capture outliers ¶

import seaborn as sns

sns.jointplot(x='Insulin', y='BMI', data=df)

<seaborn.axisgrid.JointGrid at 0x1de283ea940>

sns.kdeplot(df.Insulin, df.BMI)

<matplotlib.axes._subplots.AxesSubplot at 0x1de2afd8240>

<seaborn.axisgrid.JointGrid at 0x1de200c8198>

Pregnancies                 float64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

[0.27494 0.     ]

Text(0.5, 1.0, 'Scatterplot for blood insulin levels and BMI')

Deleting extreme values in the 'Insulin’ and 'BMI’ columns ¶

df['Insulin'] = df['Insulin'].apply(lambda x: np.nan if x > 600 else x)
df['BMI'] = df['BMI'].apply(lambda x: np.nan if x > 50 else x)

df = df.dropna(how ='any')

Scatter plot after truncating outliers.

sns.jointplot(x='Insulin', y='BMI', data=df)

<seaborn.axisgrid.JointGrid at 0x1de200c8198>

Pregnancies                 float64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

[0.27494 0.     ]

Text(0.5, 1.0, 'Scatterplot for blood insulin levels and BMI')

Pearson correlation coefficient¶

df.dtypes

Pregnancies                 float64
Glucose                     float64
BloodPressure               float64
SkinThickness               float64
Insulin                     float64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

import scipy

PKP = scipy.stats.pearsonr(df['Insulin'], df['BMI'])
PKP = np.round(PKP, decimals=5)
print(PKP)

[0.27494 0.     ]

Text(0.5, 1.0, 'Scatterplot for blood insulin levels and BMI')

There is a positive correlation between the patients’ insulin levels in the blood and the BMI body mass index. Therefore, the null hypothesis is rejected in favor of the alternative hypothesis. The correlation phenomenon is statistically significant because the p-value is less than 0.01.

Dispersion chart for correlations ¶

import matplotlib.pyplot as plt

scat2= sns.regplot(x='Insulin', y='BMI', data=df)
plt.xlabel('Blood insulin level')
plt.ylabel('BMI level')
plt.title('Scatterplot for blood insulin levels and BMI')

Text(0.5, 1.0, 'Scatterplot for blood insulin levels and BMI')

Forecasting based on Pearson’s correlation coefficient ¶

Knowing the BMI level, we can predict the level of insulin in the blood and vice versa. The correlation level here is 0.275.
The correlation square of 0.275 with us gives us value r2 = 0.0756.

So we can predict 7.5% volatility, which means that 92% volatility is not predictable based on the Pearson correlation coefficient.

	Pregnancies	Glucose	BloodPressure	SkinThickness	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	33.6	0.627	50	1
1	1	85	66	29	26.6	0.351	31	0
2	8	183	64	0	23.3	0.672	32	1

THE DATA SCIENCE LIBRARY

Wojciech Moszczyński

Simple statistical inference: an example of using the Pearson correlation coefficient

Data completeness check ¶

Pearson correlation coefficient¶

p- value¶

Assuming a normal distribution¶

A scatter plot made to capture outliers ¶

Deleting extreme values in the 'Insulin’ and 'BMI’ columns ¶

Pearson correlation coefficient¶

Dispersion chart for correlations ¶

Forecasting based on Pearson’s correlation coefficient ¶