Simple cluster analysis using the K-Means Algorithm

Matheus Such
Geek Culture
Published in
5 min readJun 29, 2021

--

Cluster analysis uses mathematical models to discover groups of similar customers based on the smallest variations among customers within each group.

Overview

This post aims to explain in a simple way how to clusters your costumers using the K-Means algorithm and how this method can help marketing teams work better with costumer campaigns. For this, I used a simple database that you can find here.

What is a Cluster Analysis?

Clustering is a set of techniques used to partition data into groups, or clusters using mathematical and machine learning models. Clusters are loosely defined as groups of data objects that are more similar to other objects in their cluster than they are to data objects in other clusters

In the business perspective, the main goal of the cluster analysis is tho achieve more effective marketing strategys, using it to target costumer with specific offers and incetives acording to their needs and preferences. One of the common cluster method is the mathematical model K-Means.

The K-Means Method

The k-means clustering method is an unsupervised machine learning technique used to identify clusters of data objects in a dataset. There are many different types of clustering methods, but k-means is one of the oldest and most approachable.

It uses an iterative technique to group unlabeled data into K clusters based on cluster centers (centroids). The data in each cluster are chosen such that their average distance to their respective centroid is minimized.

The Inertia

Inertia measures how well a dataset was clustered by K-Means. It is calculated by measuring the distance between each data point and its centroid, squaring this distance, and summing these squares across one cluster.

A good model is one with low inertia AND a low number of clusters (K). However, this is a tradeoff because as K increases, inertia decreases.

So, how do I know how many K clusters are needed?

One of the most common way is using the Elbow Method. Basically, it’s necessary to identify the point that indicates the balance between greater homogeneity and the greatest difference between clusters, this will be the point on the curve farthest from a straight line drawn between points P0 and P1 by following the equation below. Don’t worry, we will cover this futher in a easy way.

EQUATION 1 — Formula to find the distance between a point to a line

Data Analysis

The story behind the data…

You own a supermarket and through membership cards, you have some basic data about your customers and want to improve your marketing strategy by understanding your customers characteristics and buying behavior and give that to your marketing team plan future campaigns.

FIG 1 — The dataset

Here is the schema and explanation of each variable in dataset:

  • USER_ID — costumer id
  • GENDER — costumer gender
  • AGE — age of the costumer
  • ANNUAL_INCOME — costumer’s annual income
  • SPENDING_SCORE — costumer’s score based on behavior and purchansing data (higher the score, higher will be the customer expense)

Data Analysis

In the Fig 3, we can see how the customers are distributed for each demographic information. First, we have the age comparison — Women are more present in dataset, with a slight increase in ages between 35–50 years old. The Annual Income seems to be more equally distributed with its peak at 70k and the Spending Score following the same pattern with the most number of costumers at the avg with 50 points.

FIG 3 — Age, Annual Income and Spending Score distribution by gender

Now, implementing the K-Means algorithm

The first step is to calculate the inertia, this will measure how well a dataset was clustered by K-Means, the code below shows a simple function to calculate the inertia.

CODE 1 — Function the calclate the inertia

With the inertia, we can plot the chart shown in the FIG 4. It’s basically plotting the Inertias vs. The number of clusters. The green line represents the straight line drawn between points P0 and P1 that we saw in the the EQUATION 1.

FIG 4 — The Elbow Method

Using the inertia calculated in the CODE 1, we will determine the optimal number of clusters for the proposed dataset. The CODE 2, will find the farthest point in the curve from the green line.

CODE 2 — Find the optimal number of clusters

And last but not least, we will use the CODE 3 to determine the clusters and the centroid of each group.

CODE 3 — Function to calclate the clusters and the centroid

We made it…

Now that all costumers are divided into groups, we will use a scatter plot to compare the data before and after the clusters. Even with a simple dataset and a lack of information about the costumer, it was possible to spot a big difference in the segmentation by observing the FIG 5. Labeling each cluster by their characteristics:

  • Cluster 0 Important Costumers, Middle Income and Spending Score
  • Cluster 1Target Costumers, High Income with Low Spending Score
  • Cluster 2Less Important Costumers, Low Income and Spending Score
  • Cluster 3Alert, Low Income with High Spending Score
  • Cluster 4Most Important Costumers, High Income and Spending Score
FIG 5 — Clusters comparison

Below at FIG 7 we can see the gender distribution for each cluster. This is a useful information for the marketing team analyze and propose specifics campaign to increase their spending score, especially Clusters 0 and 1.

FIG 7— Gender by Cluster

This is how you can use a simple algorithm to cluster your costumers and you can add more information to further analyzes. Hope you enjoyed, below are some references and my github page.

--

--

Matheus Such
Geek Culture

Automotive Engineer and a Data Science entuasiastic.