The Pearson coefficient is used to check the correlation between two continuous variables.
Data source: https://www.kaggle.com/saurabh00007/diabetescsv
import pandas as pd
df = pd.read_csv('c:/1/diabetes.csv')
df.head(3)
The aim of the study is to check if (H1) there is a statistically significant positive correlation between blood pressure and body weight of patients.
- H0 – there is no significant statistical relationship between body weight and blood pressure
- H1 – there is a significant statistical relationship between body weight and blood pressure
Data completeness check ¶
It is noted that some information such as blood pressure or body temperature is zero. It treats as no data.
df.isnull().sum()
The data is complete but there is a lack of data in the form of zeros. I replace these zeros with shortcomings.
import numpy as np
df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI']] = df[['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin','BMI']].replace(0,np.NaN)
print('Number of rows and columns: ',df.shape)
df.isnull().sum()
Deleting records with missing data.
df = df.dropna(how ='any')
print('Number of rows and columns: ',df.shape)
df.isnull().sum()
Now there is much less data but better quality.
Pearson correlation coefficient¶
The Pearson correlation coefficient [1] measures the linear relationship between two datasets. The calculation of the p-value relies on the assumption that each dataset is normally distributed. (See Kowalski [3] for a discussion of the effects of non-normality of the input on the distribution of the correlation coefficient.) Like other correlation coefficients, this one varies between -1 and +1 with 0 implying no correlation.
Pearson correlation coefficient returns a double tuple consisting of a correlation coefficient and the corresponding p value:
The correlation coefficient can be from -1 to +1.
The null hypothesis is that two variables are uncorrelated. The p value is a number from zero to one that represents the probability that your data would have arisen if the null hypothesis were true.
p- value¶
A low p-value (such as 0.01) is considered evidence that the null hypothesis can be „Rejected”. Statisticians say p-value of 0.01 is „very significant” or say „data is significant at 0.01”
A competent researcher studying a hypothetical relationship will determine the p value before empirical study. Usually 0.01 or 0.05 are used. If the test data give a p value less than the predetermined value, the researcher claims that their study is significant and allows them to reject the null hypothesis and conclude that the relationship really exists.
In the case of the relationship between alcohol consumption and breast cancer, the correlation coefficient is about 0.4, with a very low p value – almost equal to zero. This tells us that the relationship is statistically significant (because it is less than 0.05 or 0.01)
Assuming a normal distribution¶
It is desirable that the correlated variables have distributions close to the normal distribution. However, it is not obligatory. We can also calculate Pearson’s r factor for variables with a broken assumption with normal distribution. Attention should also be paid to outliers that may disturb the obtained correlation result.
A scatter plot made to capture outliers ¶
import seaborn as sns
sns.jointplot(x='Insulin', y='BMI', data=df)
sns.kdeplot(df.Insulin, df.BMI)
Deleting extreme values in the 'Insulin’ and 'BMI’ columns ¶
df['Insulin'] = df['Insulin'].apply(lambda x: np.nan if x > 600 else x)
df['BMI'] = df['BMI'].apply(lambda x: np.nan if x > 50 else x)
df = df.dropna(how ='any')
Scatter plot after truncating outliers.
sns.jointplot(x='Insulin', y='BMI', data=df)
Pearson correlation coefficient¶
df.dtypes
import scipy
PKP = scipy.stats.pearsonr(df['Insulin'], df['BMI'])
PKP = np.round(PKP, decimals=5)
print(PKP)
There is a positive correlation between the patients’ insulin levels in the blood and the BMI body mass index. Therefore, the null hypothesis is rejected in favor of the alternative hypothesis. The correlation phenomenon is statistically significant because the p-value is less than 0.01.
Dispersion chart for correlations ¶
import matplotlib.pyplot as plt
scat2= sns.regplot(x='Insulin', y='BMI', data=df)
plt.xlabel('Blood insulin level')
plt.ylabel('BMI level')
plt.title('Scatterplot for blood insulin levels and BMI')
Forecasting based on Pearson’s correlation coefficient ¶
Knowing the BMI level, we can predict the level of insulin in the blood and vice versa. The correlation level here is 0.275.
The correlation square of 0.275 with us gives us value r2 = 0.0756.
So we can predict 7.5% volatility, which means that 92% volatility is not predictable based on the Pearson correlation coefficient.