In [1]:
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
Wholesale customers data
The dataset refers to clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories.
-
FRESH: annual spending (m.u.) on fresh products (Continuous)
-
MILK: annual spending (m.u.) on milk products (Continuous)
-
GROCERY: annual spending (m.u.) on grocery products (Continuous)
-
FROZEN: annual spending (m.u.) on frozen products (Continuous)
-
DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
-
DELICATESSEN: annual spending (m.u.) on delicatessen products (Continuous)
-
CHANNEL: customer channels - Horeca (Hotel/Restaurant/Cafe) or Retail channel (Nominal)
-
REGION: customer regions - Lisnon, Oporto or Other (Nominal)
In [2]:
df = pd.read_csv('c:/1/Wholesale customers data.csv')
df.head(2)
Out[2]:
We make clusters for two variables: 'Fresh’, 'Milk’
In [3]:
df.dtypes
Out[3]:
Dendrogram method
Dendrograms is used to count number of clusters. Dendrogram works on the distance between point of dataframe.
In [4]:
df.columns
Out[4]:
In [5]:
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(8,2), dpi= 180)
plt.title("Dendrogram: 'Fresh' and 'Milk'", fontsize=15, alpha=0.5)
dend = shc.dendrogram(shc.linkage(df[['Fresh', 'Milk']], method='ward'), labels=df.Fresh.values, color_threshold=100)
plt.xticks(fontsize=12)
plt.show()
We imagine invisible line in the middle of high of the plot. I see four branches which cross with this imagined line.
Elbow Method
In [6]:
PKS =df[['Fresh', 'Milk']]
MinMaxScaler¶
In [7]:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms.fit(PKS)
df_transformed = mms.transform(PKS)
In [8]:
Sum_of_squared_distances = []
K = range(1,15)
for k in K:
km = KMeans(n_clusters=k)
km = km.fit(df_transformed)
Sum_of_squared_distances.append(km.inertia_)
In [9]:
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()
To determine the optimal number of clusters, we must select the value of k on the „elbow”, ie the point at which the distortion / inertia begins to decrease linearly. Therefore, for the given data, we conclude that the optimal number of clusters for the data is 5.
Clustering population by BMI and Age
In [10]:
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(PKS)
Out[10]:
In [11]:
plt.figure(figsize=(10, 7))
plt.scatter(PKS['Fresh'], PKS['Milk'], c=cluster.labels_, cmap='inferno')
plt.title('Clustering population by Fresh and Milk', fontsize=22, alpha=0.5)
plt.xlabel('Fresh', fontsize=22, alpha=0.5)
plt.ylabel('Milk', fontsize=22, alpha=0.5)
Out[11]:
