clustering - THE DATA SCIENCE LIBRARY http://sigmaquality.pl/tag/clustering/ Wojciech Moszczyński Mon, 13 Dec 2021 17:12:01 +0000 pl-PL hourly 1 https://wordpress.org/?v=6.8.3 https://sigmaquality.pl/wp-content/uploads/2019/02/cropped-ryba-32x32.png clustering - THE DATA SCIENCE LIBRARY http://sigmaquality.pl/tag/clustering/ 32 32 Dendrogram and clustering 3d https://sigmaquality.pl/data-plots/dendrogram-and-clustering-3d/ Fri, 25 Oct 2019 19:56:00 +0000 http://sigmaquality.pl/dendron1/ In [1]: import scipy.cluster.hierarchy as shc import pandas as pd import matplotlib.pyplot as plt # Import Data df = pd.read_csv('c:/1/USArrests.csv') USArrests Source of data: https://www.kaggle.com/deepakg/usarrests [...]

Artykuł Dendrogram and clustering 3d pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
In [1]:
import scipy.cluster.hierarchy as shc
import pandas as pd
import matplotlib.pyplot as plt

# Import Data
df = pd.read_csv('c:/1/USArrests.csv')

USArrests

Source of data: https://www.kaggle.com/deepakg/usarrests

In [2]:
df.rename(columns = {'Unnamed: 0': 'State'}, inplace=True)
df.head(4)
Out[2]:
State Murder Assault UrbanPop Rape
0 Alabama 13.2 236 58 21.2
1 Alaska 10.0 263 48 44.5
2 Arizona 8.1 294 80 31.0
3 Arkansas 8.8 190 50 19.5
In [3]:
# Plot
plt.figure(figsize=(17, 4), dpi= 280)  
plt.title("USArrests Dendograms", fontsize=22)  
dend = shc.dendrogram(shc.linkage(df[['Murder', 'Assault', 'UrbanPop', 'Rape']], method='ward'), labels=df.State.values, color_threshold=100)  
plt.xticks(fontsize=12)
plt.show()
In [4]:
df3 = pd.read_csv('c:/1/hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv')
df3.head()
Out[4]:
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40

We have a table that shows gender, age, annual income and expenditure. We take a vector of two coordinates from the DataFrame table: annual income in k $ – a tendency to spend on a scale of 1 to 100.

In [5]:
data = df3.iloc[:, 3:5].values
data
Out[5]:
array([[ 15,  39],
       [ 15,  81],
       [ 16,   6],
       [ 16,  77],
       [ 17,  40],
       [ 17,  76],
       [ 18,   6],
       [ 18,  94],
       [ 19,   3],
       [ 19,  72],
       [ 19,  14],
       [ 19,  99],
       [ 20,  15],
       [ 20,  77],
       [ 20,  13],
       [ 20,  79],
       [ 21,  35],
In [6]:
plt.figure(figsize=(10, 3))
plt.title("Customer Dendograms")
dend = shc.dendrogram(shc.linkage(data, method='ward'))

The dendrogram showed that there are 5 clusters (5 branches) of the bank’s clients. We create a clustering matrix. Since we had five clusters, we have five labels at the output, i.e. 0 to 4.

In [7]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(data)
Out[7]:
array([4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3,
       4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 3, 4, 1,
       4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2,
       1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2,
       0, 2, 0, 2, 0, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
       0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2,
       0, 2], dtype=int64)
In [8]:
plt.figure(figsize=(10, 7))
plt.scatter(data[:,0], data[:,1], c=cluster.labels_, cmap='rainbow')
plt.title('CUSTOMERS CLUSTERINGS')
plt.xlabel('Annual earnings')
plt.ylabel('Spending')
Out[8]:
Text(0, 0.5, 'Spending')

Purple cluster – (in the lower right corner) a cluster of clients with high earnings but low expenses. Customers in the middle (blue data points) are those with average income and average salary. The largest number of customers belongs to this category.

Clinical tests

Source of data: https://www.kaggle.com/saurabh00007/diabetescsv

In [21]:
df3 = pd.read_csv('c:/1/diabetes.csv')
df3.head()
Out[21]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
In [22]:
PKP = df3[['Age','SkinThickness','BMI']]
In [23]:
PKP.head()
Out[23]:
Age SkinThickness BMI
0 50 35 33.6
1 31 29 26.6
2 32 0 23.3
3 21 23 28.1
4 33 35 43.1
The dendroid chart will tell you how many clusters you want
In [24]:
plt.figure(figsize=(17, 4), dpi= 280)  
plt.title("Customer Dendograms")
dend = shc.dendrogram(shc.linkage(PKP, method='ward'))
It seems 5 clusters
In [26]:
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(PKP['Age'], PKP['SkinThickness'], PKP['BMI'], color='black',marker='o')

ax.set_title('Clusters', fontsize= 30, alpha=0.6)
ax.set_xlabel('Age', fontsize= 20, alpha=0.6)
ax.set_ylabel('SkinThickness', fontsize= 20, alpha=0.6)
ax.set_zlabel('BMI', fontsize= 20, alpha=0.6)
Out[26]:
Text(0.5, 0, 'BMI')
In [27]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
KF = cluster.fit_predict(PKP)
KF
Out[27]:
array([3, 0, 4, 0, 1, 4, 0, 4, 3, 2, 4, 4, 2, 3, 3, 4, 1, 4, 1, 1, 1, 2,
       2, 1, 3, 3, 2, 0, 3, 4, 3, 1, 0, 4, 3, 1, 4, 3, 1, 3, 0, 4, 3, 3,
       4, 1, 4, 0, 0, 0, 0, 0, 0, 3, 1, 0, 1, 1, 2, 1, 0, 4, 4, 0, 2, 0,
       3, 2, 0, 0, 0, 1, 2, 0, 0, 0, 2, 0, 4, 0, 0, 0, 3, 0, 4, 0, 1, 0,
       3, 0, 4, 0, 1, 2, 0, 3, 0, 0, 0, 1, 4, 4, 4, 0, 4, 0, 4, 3, 0, 0,
       0, 3, 0, 4, 1, 2, 4, 4, 0, 0, 1, 1, 0, 2, 4, 1, 0, 1, 3, 2, 0, 4,
       1, 3, 0, 0, 0, 0, 4, 0, 2, 3, 0, 2, 0, 0, 1, 1, 2, 0, 1, 4, 3, 1,
       2, 1, 0, 0, 0, 1, 1, 1, 0, 0, 4, 3, 0, 4, 4, 0, 4, 0, 0, 1, 0, 1,
       2, 1, 2, 4, 4, 0, 0, 4, 4, 3, 3, 1, 1, 0, 4, 1, 4, 4, 3, 1, 4, 0,
       1, 0, 0, 4, 0, 0, 3, 0, 3, 2, 0, 3, 0, 1, 3, 0, 1, 1, 1, 0, 0, 2,
       0, 2, 4, 3, 0, 0, 4, 1, 1, 0, 4, 1, 0, 4, 0, 4, 3, 0, 0, 4, 0, 0,
       4, 0, 0, 3, 2, 1, 1, 0, 2, 4, 0, 0, 0, 1, 1, 0, 0, 3, 0, 4, 0, 3,
       4, 3, 4, 1, 4, 4, 1, 0, 4, 1, 2, 1, 0, 0, 2, 0, 4, 3, 0, 2, 2, 3,
       1, 1, 0, 1, 0, 0, 1, 1, 2, 0, 1, 0, 3, 2, 4, 0, 1, 4, 4, 0, 3, 0,
       0, 0, 1, 1, 0, 0, 3, 0, 0, 4, 1, 2, 0, 0, 0, 1, 0, 0, 0, 4, 1, 1,
       3, 0, 2, 4, 0, 1, 2, 2, 1, 2, 0, 0, 0, 4, 2, 3, 0, 4, 0, 3, 4, 4,
       3, 0, 4, 2, 1, 3, 3, 1, 1, 2, 3, 2, 0, 0, 4, 0, 0, 3, 1, 0, 0, 1,
       1, 3, 0, 1, 4, 1, 0, 0, 0, 0, 0, 0, 3, 1, 3, 0, 3, 4, 0, 0, 4, 0,
       1, 1, 4, 0, 4, 2, 1, 3, 2, 0, 2, 4, 4, 1, 1, 0, 0, 0, 1, 0, 0, 3,
       4, 0, 1, 0, 1, 0, 3, 1, 0, 3, 1, 3, 4, 0, 0, 4, 0, 4, 3, 4, 0, 4,
       3, 0, 0, 4, 0, 1, 0, 1, 1, 0, 0, 4, 0, 2, 0, 3, 2, 0, 3, 3, 3, 4,
       1, 1, 4, 0, 0, 1, 4, 1, 1, 1, 0, 2, 4, 3, 1, 0, 1, 3, 1, 1, 0, 0,
       4, 1, 1, 3, 0, 2, 0, 3, 1, 3, 0, 2, 4, 0, 3, 1, 0, 0, 1, 3, 1, 4,
       3, 0, 0, 2, 3, 0, 2, 4, 0, 0, 3, 2, 2, 2, 0, 0, 0, 2, 4, 0, 0, 0,
       0, 4, 0, 4, 1, 4, 0, 4, 2, 2, 1, 1, 1, 0, 3, 0, 0, 1, 3, 0, 3, 1,
       0, 0, 2, 0, 0, 1, 1, 2, 1, 4, 2, 0, 1, 0, 4, 0, 0, 3, 3, 1, 4, 4,
       0, 0, 0, 1, 0, 4, 4, 3, 1, 0, 3, 2, 3, 0, 2, 4, 3, 4, 1, 1, 2, 0,
       1, 0, 2, 0, 4, 0, 0, 4, 1, 3, 4, 0, 1, 0, 1, 0, 0, 3, 1, 0, 3, 4,
       4, 0, 3, 4, 1, 0, 2, 0, 4, 1, 4, 4, 2, 0, 4, 1, 4, 0, 4, 4, 2, 0,
       0, 0, 0, 4, 2, 4, 0, 0, 0, 1, 1, 0, 0, 0, 1, 4, 0, 0, 0, 1, 2, 0,
       2, 1, 1, 1, 1, 1, 3, 1, 3, 3, 3, 0, 3, 1, 2, 4, 2, 4, 4, 0, 0, 1,
       1, 4, 2, 0, 4, 0, 0, 1, 4, 2, 0, 1, 4, 3, 0, 4, 0, 4, 0, 3, 3, 2,
       0, 0, 0, 0, 2, 0, 0, 3, 1, 0, 4, 1, 1, 3, 1, 3, 0, 1, 1, 3, 2, 1,
       0, 0, 4, 4, 0, 4, 1, 0, 2, 0, 0, 3, 0, 2, 1, 0, 0, 2, 1, 3, 1, 1,
       3, 2, 4, 1, 0, 1, 3, 1, 1, 2, 4, 2, 0, 3, 4, 3, 0, 0, 2, 0],
      dtype=int64)
In [28]:
# Initializing KMeans
kmeans = KMeans(n_clusters=5)
# Fitting with inputs
kmeans = kmeans.fit(PKP)
# Predicting the clusters
labels = kmeans.predict(PKP)
# Getting the cluster centers
C = kmeans.cluster_centers_
In [29]:
C
Out[29]:
array([[25.10138249, 21.01382488, 27.84147465],
       [45.82014388, 32.33093525, 33.90359712],
       [28.86486486,  0.33108108, 29.1527027 ],
       [52.08695652,  1.26086957, 31.24782609],
       [27.02906977, 38.09883721, 38.52732558]])
In [31]:
fig = plt.figure()
ax = Axes3D(fig)
ax.scatter(PKP['Age'], PKP['SkinThickness'], PKP['BMI'], c=KF)
ax.scatter(C[:, 0], C[:, 1], C[:, 2], marker='.', c='red', s=1000)

ax.set_title('Clusters', fontsize= 30, alpha=0.6)
ax.set_xlabel('Age', fontsize= 20, alpha=0.6)
ax.set_ylabel('SkinThickness', fontsize= 20, alpha=0.6)
ax.set_zlabel('BMI', fontsize= 20, alpha=0.6)
Out[31]:
Text(0.5, 0, 'BMI')

Artykuł Dendrogram and clustering 3d pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Methods of indication optimal number of clusters: Dendrogram and Elbow Method https://sigmaquality.pl/uncategorized/dendrogram-and-elbow-method/ Thu, 24 Oct 2019 19:19:00 +0000 http://sigmaquality.pl/dendrogram-and-elbow-method/ In [1]: import pandas as pd from sklearn.cluster import KMeans import matplotlib.pyplot as plt Wholesale customers data The dataset refers to clients of a wholesale [...]

Artykuł Methods of indication optimal number of clusters: Dendrogram and Elbow Method pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
In [1]:

import pandas as pd

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

Wholesale customers data

The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.

  • FRESH: annual spending (m.u.) on fresh products (Continuous)
  • MILK: annual spending (m.u.) on milk products (Continuous)
  • GROCERY: annual spending (m.u.) on grocery products (Continuous)
  • FROZEN: annual spending (m.u.) on frozen products (Continuous)
  • DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
  • DELICATESSEN: annual spending (m.u.) on delicatessen products (Continuous)
  • CHANNEL: customer channels - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
  • REGION: customer regions - Lisnon, Oporto or Other (Nominal)
In [2]:
df = pd.read_csv('c:/1/Wholesale customers data.csv')
df.head(2)
Out[2]:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
0 2 3 12669 9656 7561 214 2674 1338
1 2 3 7057 9810 9568 1762 3293 1776

We make clusters for two variables: 'Fresh’, 'Milk’

In [3]:
df.dtypes
Out[3]:
Channel             int64
Region              int64
Fresh               int64
Milk                int64
Grocery             int64
Frozen              int64
Detergents_Paper    int64
Delicassen          int64
dtype: object

Dendrogram method

Dendrograms is used to count number of clusters. Dendrogram works on the distance between point of dataframe.

In [4]:
df.columns
Out[4]:
Index(['Channel', 'Region', 'Fresh', 'Milk', 'Grocery', 'Frozen',
       'Detergents_Paper', 'Delicassen'],
      dtype='object')
In [5]:
import scipy.cluster.hierarchy as shc

plt.figure(figsize=(8,2), dpi= 180)  
plt.title("Dendrogram: 'Fresh' and 'Milk'", fontsize=15, alpha=0.5)  
dend = shc.dendrogram(shc.linkage(df[['Fresh', 'Milk']], method='ward'), labels=df.Fresh.values, color_threshold=100)  
plt.xticks(fontsize=12)
plt.show()

We imagine invisible line in the middle of high of the plot. I see four branches which cross with this imagined line.

Elbow Method

In [6]:
PKS =df[['Fresh', 'Milk']]
MinMaxScaler
In [7]:
from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
mms.fit(PKS)
df_transformed = mms.transform(PKS)
In [8]:
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df_transformed)
    Sum_of_squared_distances.append(km.inertia_)
In [9]:
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

To determine the optimal number of clusters, we must select the value of k on the „elbow”, ie the point at which the distortion / inertia begins to decrease linearly. Therefore, for the given data, we conclude that the optimal number of clusters for the data is 5.

Clustering population by BMI and Age

In [10]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(PKS)
Out[10]:
array([4, 1, 4, 4, 2, 4, 4, 4, 4, 1, 1, 4, 2, 2, 2, 4, 1, 4, 2, 4, 2, 4,
       2, 3, 2, 2, 4, 4, 1, 0, 2, 1, 2, 2, 1, 1, 2, 4, 1, 0, 2, 2, 4, 1,
       4, 1, 1, 3, 4, 1, 4, 1, 0, 1, 2, 4, 1, 1, 4, 4, 4, 3, 4, 1, 4, 1,
       1, 2, 1, 4, 2, 2, 4, 2, 4, 2, 1, 4, 4, 1, 4, 1, 4, 2, 4, 3, 3, 0,
       4, 2, 4, 4, 1, 4, 1, 1, 1, 1, 1, 4, 4, 1, 1, 0, 4, 2, 1, 1, 1, 1,
       4, 4, 2, 4, 2, 4, 4, 4, 2, 4, 2, 1, 4, 4, 0, 0, 2, 2, 1, 0, 4, 1,
       4, 4, 4, 4, 4, 1, 4, 4, 2, 2, 2, 4, 2, 2, 4, 4, 4, 2, 2, 1, 2, 1,
       1, 1, 4, 2, 1, 1, 1, 4, 4, 1, 4, 4, 4, 1, 4, 4, 1, 1, 1, 1, 1, 1,
       0, 4, 4, 1, 4, 0, 1, 3, 1, 4, 1, 1, 4, 1, 2, 4, 4, 1, 4, 2, 2, 1,
       4, 4, 1, 1, 2, 1, 1, 1, 4, 1, 1, 1, 2, 1, 4, 1, 1, 1, 1, 2, 1, 4,
       4, 4, 4, 1, 4, 4, 2, 4, 1, 4, 4, 1, 2, 1, 4, 1, 4, 2, 4, 0, 2, 2,
       2, 4, 4, 1, 4, 4, 2, 4, 1, 1, 4, 2, 4, 2, 4, 4, 0, 0, 4, 4, 2, 1,
       1, 1, 1, 2, 4, 2, 4, 1, 1, 0, 1, 1, 2, 4, 4, 2, 1, 4, 0, 2, 0, 0,
       4, 4, 2, 0, 1, 4, 1, 1, 2, 4, 2, 4, 4, 1, 2, 1, 1, 1, 1, 1, 1, 2,
       4, 1, 4, 2, 1, 4, 4, 1, 4, 1, 4, 1, 1, 4, 2, 4, 2, 2, 4, 1, 2, 4,
       4, 4, 2, 4, 2, 2, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 4, 1, 2, 1, 1, 1, 2, 4, 1, 4, 1, 4, 4, 1, 2, 4, 0, 2, 1, 2,
       4, 4, 1, 0, 1, 4, 2, 2, 2, 1, 1, 4, 1, 2, 4, 4, 1, 1, 1, 2, 4, 4,
       1, 4, 4, 4, 4, 2, 2, 2, 2, 4, 2, 1, 4, 4, 4, 1, 1, 4, 4, 4, 1, 4,
       1, 4, 4, 2, 2, 2, 2, 4, 4, 2, 1, 4, 1, 4, 2, 1, 2, 2, 0, 4, 4, 1],
      dtype=int64)
In [11]:
plt.figure(figsize=(10, 7))
plt.scatter(PKS['Fresh'], PKS['Milk'], c=cluster.labels_, cmap='inferno')
plt.title('Clustering population by Fresh and Milk', fontsize=22, alpha=0.5)
plt.xlabel('Fresh', fontsize=22, alpha=0.5)
plt.ylabel('Milk', fontsize=22, alpha=0.5)
Out[11]:
Text(0, 0.5, 'Milk')

Artykuł Methods of indication optimal number of clusters: Dendrogram and Elbow Method pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Perfect plots: Dendrogram https://sigmaquality.pl/data-plots/perfect-plots_-dendrogram/ Wed, 23 Oct 2019 19:32:00 +0000 http://sigmaquality.pl/perfect-plots_-dendrogram/ Feel free to read the code on GitHub Dendrograms is used to count number of clusters.Dendrogram works on the distance between point of dataframe. In [ ]: [...]

Artykuł Perfect plots: Dendrogram pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>
Feel free to read the code on GitHub

Dendrograms is used to count number of clusters.
Dendrogram works on the distance between point of dataframe.

In [ ]:
import scipy.cluster.hierarchy as shc
import pandas as pd
import matplotlib.pyplot as plt
 

Clinical tests

Source of data: https://www.kaggle.com/saurabh00007/diabetescsv

In [48]:
df = pd.read_csv('c:/1/diabetes.csv')
df.head()
Out[48]:
  Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
In [47]:
df.shape
Out[47]:
(768, 9)
 

Test for attractiveness of women

We have population consist of 768 females in 9 categories. For the first plot we took two variables: 'BMI’, 'Age’

Body mass index (BMI) is a value derived from the mass (weight) and height of a person. The BMI is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m2, resulting from mass in kilograms and height in metres. https://en.wikipedia.org/wiki/Body_mass_index

In [58]:
plt.figure(figsize=(17, 4), dpi= 280)  
plt.title("Dendrogram of female population in BMI and Age", fontsize=22, alpha=0.5)  
dend = shc.dendrogram(shc.linkage(df[['BMI', 'Age']], method='ward'), labels=df.Outcome.values, color_threshold=100)  
plt.xticks(fontsize=12)
plt.show()
 

We count last legs of dark blue tree. We have counted 6 branches. We ought to use 6 clusters.

In [26]:
PKP = df[['BMI', 'Age']]
PKP.head()
Out[26]:
  BMI Age
0 33.6 50
1 26.6 31
2 23.3 32
3 28.1 21
4 43.1 33
 

Clustering population by BMI and Age

In [54]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=6, affinity='euclidean', linkage='ward')
cluster.fit_predict(PKP)
Out[54]:
array([1, 4, 4, 3, 1, 4, 3, 3, 2, 2, 3, 3, 2, 2, 2, 4, 5, 4, 1, 4, 3, 1,
       1, 4, 1, 1, 1, 0, 2, 4, 2, 3, 0, 0, 4, 4, 4, 1, 3, 2, 3, 1, 2, 1,
       4, 5, 4, 3, 3, 0, 0, 0, 4, 2, 1, 0, 1, 5, 1, 5, 0, 1, 4, 0, 1, 4,
       4, 1, 0, 3, 3, 3, 1, 3, 3, 0, 1, 3, 5, 0, 0, 0, 4, 0, 1, 3, 1, 3,
       1, 0, 0, 4, 1, 2, 0, 1, 3, 0, 3, 5, 3, 0, 0, 0, 3, 3, 0, 4, 3, 3,
       3, 1, 3, 3, 4, 2, 4, 3, 3, 0, 5, 3, 3, 2, 3, 5, 5, 3, 1, 2, 4, 4,
       3, 1, 0, 3, 3, 3, 3, 3, 2, 1, 3, 1, 3, 0, 1, 4, 2, 3, 3, 4, 1, 5,
       1, 1, 0, 0, 3, 1, 4, 1, 5, 3, 4, 4, 3, 4, 3, 4, 4, 3, 3, 5, 4, 4,
       1, 5, 1, 1, 4, 3, 3, 4, 4, 1, 2, 4, 4, 3, 0, 1, 4, 1, 4, 3, 0, 0,
       3, 3, 3, 3, 4, 0, 2, 0, 2, 1, 3, 1, 3, 5, 2, 5, 4, 1, 3, 4, 4, 1,
       3, 2, 4, 2, 0, 3, 3, 3, 3, 5, 5, 1, 0, 3, 3, 5, 1, 5, 4, 0, 3, 3,
       0, 4, 3, 2, 1, 5, 4, 3, 1, 4, 0, 3, 4, 3, 4, 3, 0, 1, 4, 3, 3, 2,
       4, 1, 3, 5, 0, 4, 1, 0, 4, 3, 1, 5, 4, 3, 2, 0, 3, 4, 4, 1, 2, 2,
       3, 5, 0, 3, 3, 3, 5, 5, 2, 3, 4, 3, 1, 2, 3, 3, 4, 5, 4, 3, 2, 0,
       3, 3, 4, 5, 4, 3, 1, 3, 0, 3, 3, 2, 4, 3, 4, 4, 3, 0, 3, 4, 5, 4,
       2, 3, 1, 2, 0, 5, 1, 1, 4, 1, 0, 4, 3, 4, 2, 1, 3, 0, 0, 1, 5, 4,
       1, 0, 5, 2, 3, 1, 1, 3, 3, 2, 2, 2, 3, 3, 4, 0, 3, 1, 5, 0, 3, 3,
       3, 2, 0, 5, 5, 1, 3, 0, 0, 0, 0, 0, 4, 1, 2, 3, 1, 5, 0, 4, 3, 3,
       4, 3, 0, 3, 3, 2, 4, 4, 1, 5, 1, 0, 4, 5, 3, 3, 5, 0, 3, 3, 3, 1,
       0, 0, 5, 0, 5, 3, 1, 3, 0, 4, 5, 1, 0, 4, 3, 4, 4, 5, 1, 3, 0, 3,
       1, 3, 3, 4, 4, 5, 0, 3, 3, 3, 0, 3, 5, 2, 3, 4, 2, 3, 1, 2, 2, 0,
       4, 4, 4, 0, 3, 3, 4, 5, 5, 3, 5, 2, 3, 2, 3, 4, 4, 2, 4, 3, 4, 5,
       5, 5, 5, 1, 0, 2, 3, 1, 4, 1, 0, 2, 4, 3, 2, 4, 0, 3, 1, 1, 1, 4,
       4, 3, 3, 2, 1, 0, 2, 3, 0, 3, 2, 1, 1, 2, 0, 3, 0, 1, 3, 0, 0, 0,
       3, 4, 3, 5, 5, 4, 3, 3, 1, 2, 3, 3, 1, 3, 2, 5, 3, 4, 1, 3, 1, 4,
       3, 3, 2, 3, 3, 4, 3, 2, 1, 4, 2, 5, 3, 4, 3, 0, 5, 1, 4, 4, 1, 0,
       3, 3, 3, 3, 4, 5, 4, 2, 5, 0, 2, 1, 2, 0, 1, 4, 1, 0, 1, 5, 1, 3,
       4, 3, 1, 3, 4, 0, 0, 0, 4, 2, 4, 3, 5, 0, 5, 0, 3, 4, 1, 3, 1, 0,
       4, 0, 2, 3, 3, 0, 1, 5, 3, 3, 0, 3, 1, 0, 4, 3, 0, 3, 4, 4, 2, 3,
       3, 0, 3, 3, 2, 4, 4, 3, 4, 5, 4, 0, 0, 3, 3, 4, 3, 5, 0, 1, 1, 3,
       2, 5, 1, 1, 1, 3, 2, 4, 1, 1, 1, 0, 1, 5, 2, 4, 2, 3, 3, 0, 0, 5,
       5, 3, 2, 3, 0, 4, 0, 1, 4, 1, 5, 1, 0, 1, 4, 0, 3, 5, 3, 2, 1, 1,
       4, 3, 0, 3, 1, 5, 3, 4, 1, 0, 4, 4, 3, 2, 3, 1, 4, 5, 4, 1, 1, 1,
       3, 3, 0, 3, 4, 3, 5, 3, 2, 3, 3, 1, 3, 1, 1, 3, 3, 1, 1, 1, 5, 5,
       4, 2, 3, 3, 0, 5, 1, 4, 1, 1, 3, 2, 3, 1, 4, 2, 3, 4, 1, 3],
      dtype=int64)
 

We have classes in range from 0 to 5.

In [56]:
plt.figure(figsize=(10, 7))
plt.scatter(PKP['BMI'], PKP['Age'], c=cluster.labels_, cmap='rainbow')
plt.title('Clustering population by BMI and Age', fontsize=22, alpha=0.5)
plt.xlabel('BMI', fontsize=22, alpha=0.5)
plt.ylabel('Age', fontsize=22, alpha=0.5)
Out[56]:
Text(0, 0.5, 'Age')
 

Females in cluster red are young and overweight. Patients in violet cluster are young and skinny.

Test for predisposition to diabetes

’DiabetesPedigreeFunction’: level of predisposition to diabetes from the family interview.

In [63]:
plt.figure(figsize=(17, 4), dpi= 280)  
plt.title("Dendrogram of female population in BMI and Age", fontsize=22, alpha=0.5)  
dend = shc.dendrogram(shc.linkage(df[['DiabetesPedigreeFunction', 'Insulin']], method='ward'), labels=df.Outcome.values, color_threshold=100)  
plt.xticks(fontsize=12)
plt.show()
In [62]:
PGK = df[['DiabetesPedigreeFunction', 'Insulin']]
In [61]:
from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=7, affinity='euclidean', linkage='ward')
cluster.fit_predict(PGK)
Out[61]:
array([3, 3, 3, 5, 1, 3, 2, 3, 4, 3, 3, 3, 3, 6, 1, 3, 1, 3, 2, 5, 1, 3,
       3, 3, 5, 5, 3, 5, 5, 3, 3, 1, 2, 3, 3, 1, 3, 3, 3, 1, 2, 3, 3, 1,
       3, 3, 3, 3, 3, 3, 2, 2, 3, 0, 0, 3, 0, 5, 3, 5, 3, 3, 3, 5, 3, 3,
       3, 3, 2, 5, 2, 5, 3, 0, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 5, 3, 2,
       5, 3, 3, 1, 2, 3, 2, 1, 3, 2, 2, 1, 3, 3, 3, 2, 3, 1, 3, 5, 3, 2,
       5, 4, 2, 3, 1, 3, 3, 3, 3, 2, 5, 3, 5, 3, 3, 5, 5, 5, 5, 3, 1, 3,
       1, 3, 2, 5, 2, 2, 3, 0, 3, 3, 2, 3, 0, 3, 3, 5, 3, 3, 1, 3, 1, 4,
       3, 3, 5, 5, 2, 5, 3, 5, 0, 3, 3, 1, 3, 3, 3, 2, 3, 5, 3, 2, 2, 5,
       3, 5, 3, 3, 3, 2, 3, 3, 3, 3, 4, 2, 5, 1, 3, 5, 3, 3, 3, 1, 3, 2,
       5, 0, 3, 3, 3, 2, 1, 3, 0, 3, 2, 3, 3, 3, 3, 5, 1, 0, 5, 5, 3, 3,
       4, 3, 3, 1, 2, 2, 3, 3, 6, 2, 3, 0, 2, 3, 2, 3, 1, 3, 3, 3, 3, 2,
       3, 1, 1, 3, 3, 6, 0, 3, 3, 3, 2, 3, 0, 3, 3, 3, 0, 1, 5, 3, 3, 3,
       3, 2, 3, 3, 3, 3, 3, 2, 3, 2, 3, 2, 3, 5, 3, 0, 3, 5, 1, 3, 3, 5,
       4, 1, 2, 2, 2, 2, 1, 1, 3, 5, 0, 1, 1, 3, 3, 5, 2, 3, 3, 5, 5, 1,
       1, 1, 3, 1, 5, 2, 3, 5, 2, 3, 5, 3, 1, 3, 3, 2, 3, 1, 1, 3, 5, 2,
       3, 2, 3, 3, 2, 0, 3, 3, 1, 3, 5, 2, 3, 3, 3, 5, 2, 3, 2, 3, 3, 3,
       3, 2, 3, 3, 1, 3, 2, 1, 0, 3, 3, 3, 0, 2, 3, 3, 2, 5, 4, 2, 2, 5,
       1, 0, 2, 2, 3, 2, 2, 3, 1, 2, 5, 2, 3, 3, 0, 2, 1, 3, 0, 2, 3, 0,
       5, 3, 3, 3, 3, 3, 2, 3, 3, 1, 3, 3, 3, 4, 3, 1, 0, 2, 1, 4, 3, 3,
       3, 5, 1, 2, 2, 3, 1, 0, 3, 1, 5, 1, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3,
       3, 2, 5, 3, 3, 3, 2, 2, 2, 2, 5, 3, 1, 3, 5, 3, 3, 2, 1, 2, 2, 3,
       2, 3, 3, 5, 2, 5, 3, 5, 3, 3, 3, 3, 3, 3, 1, 5, 2, 3, 0, 3, 2, 5,
       3, 1, 4, 0, 3, 3, 2, 3, 3, 5, 3, 3, 3, 2, 5, 1, 2, 3, 3, 2, 3, 3,
       2, 1, 2, 3, 3, 1, 3, 3, 2, 5, 1, 3, 3, 0, 2, 5, 3, 3, 3, 3, 2, 5,
       1, 3, 5, 3, 2, 3, 2, 3, 3, 3, 1, 1, 1, 1, 3, 2, 2, 1, 1, 1, 2, 3,
       3, 5, 3, 2, 5, 1, 3, 3, 3, 3, 3, 0, 2, 2, 3, 2, 3, 5, 5, 1, 3, 3,
       2, 5, 0, 2, 5, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 1, 3, 3, 5, 3, 5,
       1, 1, 3, 3, 3, 5, 3, 3, 3, 5, 3, 3, 0, 2, 0, 1, 1, 1, 0, 3, 5, 3,
       3, 3, 3, 3, 1, 3, 3, 5, 3, 2, 3, 3, 3, 3, 3, 2, 3, 1, 3, 3, 3, 2,
       2, 2, 5, 3, 3, 3, 1, 0, 5, 1, 5, 3, 5, 5, 2, 3, 5, 4, 2, 1, 3, 2,
       3, 3, 1, 5, 3, 5, 3, 3, 1, 5, 1, 3, 2, 1, 3, 3, 3, 3, 3, 0, 2, 3,
       5, 3, 3, 1, 3, 3, 1, 1, 3, 3, 5, 5, 3, 4, 5, 3, 1, 3, 1, 3, 3, 3,
       5, 3, 3, 0, 3, 1, 0, 3, 3, 0, 3, 0, 1, 3, 1, 3, 3, 1, 5, 5, 3, 3,
       1, 3, 3, 3, 2, 3, 5, 1, 3, 3, 5, 3, 1, 3, 1, 5, 5, 3, 5, 5, 3, 2,
       1, 3, 3, 2, 3, 4, 3, 5, 3, 3, 3, 3, 3, 3, 3, 1, 3, 5, 3, 3],
      dtype=int64)
In [66]:
plt.figure(figsize=(10, 7))
plt.scatter(PGK['DiabetesPedigreeFunction'], PGK['Insulin'], c=cluster.labels_, cmap='rainbow')
plt.title('Clustering population by BMI and Age', fontsize=22, alpha=0.5)
plt.xlabel('DiabetesPedigreeFunction', fontsize=22, alpha=0.5)
plt.ylabel('Insulin', fontsize=22, alpha=0.5)
Out[66]:
Text(0, 0.5, 'Insulin')

Artykuł Perfect plots: Dendrogram pochodzi z serwisu THE DATA SCIENCE LIBRARY.

]]>