Methods of indication optimal number of clusters: Dendrogram and Elbow Method

24/10/2019 admin Uncategorized 0

VC-1050

In [1]:

import pandas as pd

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

Wholesale customers data

The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.

FRESH: annual spending (m.u.) on fresh products (Continuous)

MILK: annual spending (m.u.) on milk products (Continuous)

GROCERY: annual spending (m.u.) on grocery products (Continuous)

FROZEN: annual spending (m.u.) on frozen products (Continuous)

DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)

DELICATESSEN: annual spending (m.u.) on delicatessen products (Continuous)

CHANNEL: customer channels - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)

REGION: customer regions - Lisnon, Oporto or Other (Nominal)

In [2]:

df = pd.read_csv('c:/1/Wholesale customers data.csv')
df.head(2)

Out[2]:

	Channel	Region	Fresh	Milk	Grocery	Frozen	Detergents_Paper	Delicassen
0	2	3	12669	9656	7561	214	2674	1338
1	2	3	7057	9810	9568	1762	3293	1776

We make clusters for two variables: 'Fresh’, 'Milk’

In [3]:

df.dtypes

Out[3]:

Channel             int64
Region              int64
Fresh               int64
Milk                int64
Grocery             int64
Frozen              int64
Detergents_Paper    int64
Delicassen          int64
dtype: object

Dendrogram method

Dendrograms is used to count number of clusters. Dendrogram works on the distance between point of dataframe.

In [4]:

df.columns

Out[4]:

Index(['Channel', 'Region', 'Fresh', 'Milk', 'Grocery', 'Frozen',
       'Detergents_Paper', 'Delicassen'],
      dtype='object')

In [5]:

import scipy.cluster.hierarchy as shc

plt.figure(figsize=(8,2), dpi= 180)  
plt.title("Dendrogram: 'Fresh' and 'Milk'", fontsize=15, alpha=0.5)  
dend = shc.dendrogram(shc.linkage(df[['Fresh', 'Milk']], method='ward'), labels=df.Fresh.values, color_threshold=100)  
plt.xticks(fontsize=12)
plt.show()

We imagine invisible line in the middle of high of the plot. I see four branches which cross with this imagined line.

Elbow Method

In [6]:

PKS =df[['Fresh', 'Milk']]

MinMaxScaler¶

In [7]:

from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
mms.fit(PKS)
df_transformed = mms.transform(PKS)

In [8]:

Sum_of_squared_distances = []
K = range(1,15)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(df_transformed)
    Sum_of_squared_distances.append(km.inertia_)

In [9]:

plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

To determine the optimal number of clusters, we must select the value of k on the „elbow”, ie the point at which the distortion / inertia begins to decrease linearly. Therefore, for the given data, we conclude that the optimal number of clusters for the data is 5.

Clustering population by BMI and Age

In [10]:

from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(PKS)

Out[10]:

array([4, 1, 4, 4, 2, 4, 4, 4, 4, 1, 1, 4, 2, 2, 2, 4, 1, 4, 2, 4, 2, 4,
       2, 3, 2, 2, 4, 4, 1, 0, 2, 1, 2, 2, 1, 1, 2, 4, 1, 0, 2, 2, 4, 1,
       4, 1, 1, 3, 4, 1, 4, 1, 0, 1, 2, 4, 1, 1, 4, 4, 4, 3, 4, 1, 4, 1,
       1, 2, 1, 4, 2, 2, 4, 2, 4, 2, 1, 4, 4, 1, 4, 1, 4, 2, 4, 3, 3, 0,
       4, 2, 4, 4, 1, 4, 1, 1, 1, 1, 1, 4, 4, 1, 1, 0, 4, 2, 1, 1, 1, 1,
       4, 4, 2, 4, 2, 4, 4, 4, 2, 4, 2, 1, 4, 4, 0, 0, 2, 2, 1, 0, 4, 1,
       4, 4, 4, 4, 4, 1, 4, 4, 2, 2, 2, 4, 2, 2, 4, 4, 4, 2, 2, 1, 2, 1,
       1, 1, 4, 2, 1, 1, 1, 4, 4, 1, 4, 4, 4, 1, 4, 4, 1, 1, 1, 1, 1, 1,
       0, 4, 4, 1, 4, 0, 1, 3, 1, 4, 1, 1, 4, 1, 2, 4, 4, 1, 4, 2, 2, 1,
       4, 4, 1, 1, 2, 1, 1, 1, 4, 1, 1, 1, 2, 1, 4, 1, 1, 1, 1, 2, 1, 4,
       4, 4, 4, 1, 4, 4, 2, 4, 1, 4, 4, 1, 2, 1, 4, 1, 4, 2, 4, 0, 2, 2,
       2, 4, 4, 1, 4, 4, 2, 4, 1, 1, 4, 2, 4, 2, 4, 4, 0, 0, 4, 4, 2, 1,
       1, 1, 1, 2, 4, 2, 4, 1, 1, 0, 1, 1, 2, 4, 4, 2, 1, 4, 0, 2, 0, 0,
       4, 4, 2, 0, 1, 4, 1, 1, 2, 4, 2, 4, 4, 1, 2, 1, 1, 1, 1, 1, 1, 2,
       4, 1, 4, 2, 1, 4, 4, 1, 4, 1, 4, 1, 1, 4, 2, 4, 2, 2, 4, 1, 2, 4,
       4, 4, 2, 4, 2, 2, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
       1, 1, 4, 1, 2, 1, 1, 1, 2, 4, 1, 4, 1, 4, 4, 1, 2, 4, 0, 2, 1, 2,
       4, 4, 1, 0, 1, 4, 2, 2, 2, 1, 1, 4, 1, 2, 4, 4, 1, 1, 1, 2, 4, 4,
       1, 4, 4, 4, 4, 2, 2, 2, 2, 4, 2, 1, 4, 4, 4, 1, 1, 4, 4, 4, 1, 4,
       1, 4, 4, 2, 2, 2, 2, 4, 4, 2, 1, 4, 1, 4, 2, 1, 2, 2, 0, 4, 4, 1],
      dtype=int64)

In [11]:

plt.figure(figsize=(10, 7))
plt.scatter(PKS['Fresh'], PKS['Milk'], c=cluster.labels_, cmap='inferno')
plt.title('Clustering population by Fresh and Milk', fontsize=22, alpha=0.5)
plt.xlabel('Fresh', fontsize=22, alpha=0.5)
plt.ylabel('Milk', fontsize=22, alpha=0.5)

Out[11]:

Text(0, 0.5, 'Milk')

Copyright © 2026 | WordPress Theme by MH Themes