
In [1]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
Wholesale customers data
The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.
-
FRESH: annual spending (m.u.) on fresh products (Continuous)
-
MILK: annual spending (m.u.) on milk products (Continuous)
-
GROCERY: annual spending (m.u.) on grocery products (Continuous)
-
FROZEN: annual spending (m.u.) on frozen products (Continuous)
-
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
-
DELICATESSEN: annual spending (m.u.) on delicatessen products (Continuous)
-
CHANNEL: customer channels - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
-
REGION: customer regions - Lisnon, Oporto or Other (Nominal)
df = pd.read_csv('c:/1/Wholesale customers data.csv')
df.head(2)
We make clusters for two variables: ‘Fresh’, ‘Milk’
df.dtypes
Dendrogram method
Dendrograms is used to count number of clusters. Dendrogram works on the distance between point of dataframe.
df.columns
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(8,2), dpi= 180)
plt.title("Dendrogram: 'Fresh' and 'Milk'", fontsize=15, alpha=0.5)
dend = shc.dendrogram(shc.linkage(df[['Fresh', 'Milk']], method='ward'), labels=df.Fresh.values, color_threshold=100)
plt.xticks(fontsize=12)
plt.show()
We imagine invisible line in the middle of high of the plot. I see four branches which cross with this imagined line.
Elbow Method
PKS =df[['Fresh', 'Milk']]
MinMaxScaler¶
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms.fit(PKS)
df_transformed = mms.transform(PKS)
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_transformed)
Sum_of_squared_distances.append(km.inertia_)
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
To determine the optimal number of clusters, we must select the value of k on the “elbow”, ie the point at which the distortion / inertia begins to decrease linearly. Therefore, for the given data, we conclude that the optimal number of clusters for the data is 5.
Clustering population by BMI and Age
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(PKS)
plt.figure(figsize=(10, 7))
plt.scatter(PKS['Fresh'], PKS['Milk'], c=cluster.labels_, cmap='inferno')
plt.title('Clustering population by Fresh and Milk', fontsize=22, alpha=0.5)
plt.xlabel('Fresh', fontsize=22, alpha=0.5)
plt.ylabel('Milk', fontsize=22, alpha=0.5)