Feel free to read the code on GitHub
Dendrograms is used to count number of clusters.
Dendrogram works on the distance between point of dataframe.
import scipy.cluster.hierarchy as shc
import pandas as pd
import matplotlib.pyplot as plt
Clinical tests
Source of data: https://www.kaggle.com/saurabh00007/diabetescsv
df = pd.read_csv('c:/1/diabetes.csv')
df.head()
df.shape
Test for attractiveness of women
We have population consist of 768 females in 9 categories. For the first plot we took two variables: 'BMI’, 'Age’
Body mass index (BMI) is a value derived from the mass (weight) and height of a person. The BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m2, resulting from mass in kilograms and height in metres. https://en.wikipedia.org/wiki/Body_mass_index
plt.figure(figsize=(17, 4), dpi= 280)
plt.title("Dendrogram of female population in BMI and Age", fontsize=22, alpha=0.5)
dend = shc.dendrogram(shc.linkage(df[['BMI', 'Age']], method='ward'), labels=df.Outcome.values, color_threshold=100)
plt.xticks(fontsize=12)
plt.show()
We count last legs of dark blue tree. We have counted 6 branches. We ought to use 6 clusters.
PKP = df[['BMI', 'Age']]
PKP.head()
Clustering population by BMI and Age
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='ward')
cluster.fit_predict(PKP)
We have classes in range from 0 to 5.
plt.figure(figsize=(10, 7))
plt.scatter(PKP['BMI'], PKP['Age'], c=cluster.labels_, cmap='rainbow')
plt.title('Clustering population by BMI and Age', fontsize=22, alpha=0.5)
plt.xlabel('BMI', fontsize=22, alpha=0.5)
plt.ylabel('Age', fontsize=22, alpha=0.5)
plt.figure(figsize=(17, 4), dpi= 280)
plt.title("Dendrogram of female population in BMI and Age", fontsize=22, alpha=0.5)
dend = shc.dendrogram(shc.linkage(df[['DiabetesPedigreeFunction', 'Insulin']], method='ward'), labels=df.Outcome.values, color_threshold=100)
plt.xticks(fontsize=12)
plt.show()
PGK = df[['DiabetesPedigreeFunction', 'Insulin']]
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=7, affinity='euclidean', linkage='ward')
cluster.fit_predict(PGK)
plt.figure(figsize=(10, 7))
plt.scatter(PGK['DiabetesPedigreeFunction'], PGK['Insulin'], c=cluster.labels_, cmap='rainbow')
plt.title('Clustering population by BMI and Age', fontsize=22, alpha=0.5)
plt.xlabel('DiabetesPedigreeFunction', fontsize=22, alpha=0.5)
plt.ylabel('Insulin', fontsize=22, alpha=0.5)

