# Methods of indication optimal number of clusters: Dendrogram and Elbow Method

In [1]:

```import pandas as pd

from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
```

### Wholesale customers data

The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.

• `FRESH: annual spending (m.u.) on fresh products (Continuous)`
• `MILK: annual spending (m.u.) on milk products (Continuous)`
• `GROCERY: annual spending (m.u.) on grocery products (Continuous)`
• `FROZEN: annual spending (m.u.) on frozen products (Continuous)`
• `DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)`
• `DELICATESSEN: annual spending (m.u.) on delicatessen products (Continuous)`
• `CHANNEL: customer channels - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)`
• `REGION: customer regions - Lisnon, Oporto or Other (Nominal)`
In [2]:
```df = pd.read_csv('c:/1/Wholesale customers data.csv')
df.head(2)
```
Out[2]:
Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
0 2 3 12669 9656 7561 214 2674 1338
1 2 3 7057 9810 9568 1762 3293 1776

# We make clusters for two variables: ‘Fresh’, ‘Milk’

In [3]:
```df.dtypes
```
Out[3]:
```Channel             int64
Region              int64
Fresh               int64
Milk                int64
Grocery             int64
Frozen              int64
Detergents_Paper    int64
Delicassen          int64
dtype: object```

## Dendrogram method

Dendrograms is used to count number of clusters. Dendrogram works on the distance between point of dataframe.

In [4]:
```df.columns
```
Out[4]:
```Index(['Channel', 'Region', 'Fresh', 'Milk', 'Grocery', 'Frozen',
'Detergents_Paper', 'Delicassen'],
dtype='object')```
In [5]:
```import scipy.cluster.hierarchy as shc

plt.figure(figsize=(8,2), dpi= 180)
plt.title("Dendrogram: 'Fresh' and 'Milk'", fontsize=15, alpha=0.5)
dend = shc.dendrogram(shc.linkage(df[['Fresh', 'Milk']], method='ward'), labels=df.Fresh.values, color_threshold=100)
plt.xticks(fontsize=12)
plt.show()
```

We imagine invisible line in the middle of high of the plot. I see four branches which cross with this imagined line.

## Elbow Method

In [6]:
```PKS =df[['Fresh', 'Milk']]
```
##### MinMaxScaler¶
In [7]:
```from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
mms.fit(PKS)
df_transformed = mms.transform(PKS)
```
In [8]:
```Sum_of_squared_distances = []
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_transformed)
Sum_of_squared_distances.append(km.inertia_)
```
In [9]:
```plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
```

To determine the optimal number of clusters, we must select the value of k on the “elbow”, ie the point at which the distortion / inertia begins to decrease linearly. Therefore, for the given data, we conclude that the optimal number of clusters for the data is 5.

# Clustering population by BMI and Age

In [10]:
```from sklearn.cluster import AgglomerativeClustering

cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(PKS)
```
Out[10]:
```array([4, 1, 4, 4, 2, 4, 4, 4, 4, 1, 1, 4, 2, 2, 2, 4, 1, 4, 2, 4, 2, 4,
2, 3, 2, 2, 4, 4, 1, 0, 2, 1, 2, 2, 1, 1, 2, 4, 1, 0, 2, 2, 4, 1,
4, 1, 1, 3, 4, 1, 4, 1, 0, 1, 2, 4, 1, 1, 4, 4, 4, 3, 4, 1, 4, 1,
1, 2, 1, 4, 2, 2, 4, 2, 4, 2, 1, 4, 4, 1, 4, 1, 4, 2, 4, 3, 3, 0,
4, 2, 4, 4, 1, 4, 1, 1, 1, 1, 1, 4, 4, 1, 1, 0, 4, 2, 1, 1, 1, 1,
4, 4, 2, 4, 2, 4, 4, 4, 2, 4, 2, 1, 4, 4, 0, 0, 2, 2, 1, 0, 4, 1,
4, 4, 4, 4, 4, 1, 4, 4, 2, 2, 2, 4, 2, 2, 4, 4, 4, 2, 2, 1, 2, 1,
1, 1, 4, 2, 1, 1, 1, 4, 4, 1, 4, 4, 4, 1, 4, 4, 1, 1, 1, 1, 1, 1,
0, 4, 4, 1, 4, 0, 1, 3, 1, 4, 1, 1, 4, 1, 2, 4, 4, 1, 4, 2, 2, 1,
4, 4, 1, 1, 2, 1, 1, 1, 4, 1, 1, 1, 2, 1, 4, 1, 1, 1, 1, 2, 1, 4,
4, 4, 4, 1, 4, 4, 2, 4, 1, 4, 4, 1, 2, 1, 4, 1, 4, 2, 4, 0, 2, 2,
2, 4, 4, 1, 4, 4, 2, 4, 1, 1, 4, 2, 4, 2, 4, 4, 0, 0, 4, 4, 2, 1,
1, 1, 1, 2, 4, 2, 4, 1, 1, 0, 1, 1, 2, 4, 4, 2, 1, 4, 0, 2, 0, 0,
4, 4, 2, 0, 1, 4, 1, 1, 2, 4, 2, 4, 4, 1, 2, 1, 1, 1, 1, 1, 1, 2,
4, 1, 4, 2, 1, 4, 4, 1, 4, 1, 4, 1, 1, 4, 2, 4, 2, 2, 4, 1, 2, 4,
4, 4, 2, 4, 2, 2, 4, 4, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1,
1, 1, 4, 1, 2, 1, 1, 1, 2, 4, 1, 4, 1, 4, 4, 1, 2, 4, 0, 2, 1, 2,
4, 4, 1, 0, 1, 4, 2, 2, 2, 1, 1, 4, 1, 2, 4, 4, 1, 1, 1, 2, 4, 4,
1, 4, 4, 4, 4, 2, 2, 2, 2, 4, 2, 1, 4, 4, 4, 1, 1, 4, 4, 4, 1, 4,
1, 4, 4, 2, 2, 2, 2, 4, 4, 2, 1, 4, 1, 4, 2, 1, 2, 2, 0, 4, 4, 1],
dtype=int64)```
In [11]:
```plt.figure(figsize=(10, 7))
plt.scatter(PKS['Fresh'], PKS['Milk'], c=cluster.labels_, cmap='inferno')
plt.title('Clustering population by Fresh and Milk', fontsize=22, alpha=0.5)
plt.xlabel('Fresh', fontsize=22, alpha=0.5)
plt.ylabel('Milk', fontsize=22, alpha=0.5)
```
Out[11]:
`Text(0, 0.5, 'Milk')`