How K-Means Clustering Algorithm Works

December 10, 2020 Bhavishya Pandit

K-means clustering is one of the most widely recognized and utilized algorithms in the realm of unsupervised machine learning. With its roots in vector quantization and signal processing, this technique has found its application in diverse areas ranging from image segmentation to market segmentation.

But what makes k-means clustering so prevalent in the data science community?

Is it its intuitive approach, its efficiency, or its versatility? Dive into this article as we demystify the mechanics behind k-means clustering, taking you through its foundational principles, its step-by-step process, and showcasing its profound impact on making sense of vast and complex datasets.

In today’s world, where machine learning models implementation is so easy to find anywhere over the internet. It becomes paramount for all machine learning enthusiasts to get their hands dirty on topics related to it.

There are many fascinating topics of supervised and unsupervised learning or even reinforcement learning come up. But my favorite is the k-means clustering algorithm.

As the name suggests, it is a clustering algorithm.

Learn the popular clustering algorithm k-means clustering along with the implementation in python. #datascience #unsupervisedlearning #machinelearning #kmeansclustering #python

Click to Tweet

Even if you don’t know what is clustering, still, it is ok.

By the end of this article, you will learn everything you need to know about k-means clustering.

After reading this article, you don’t need to brush up on k-means clustering topics before attending any data scientist job interview.

Excited to learn 🙂

We are too 🙂

Great, before starting the article, let’s look at the topics you are going to learn in this article. Only if you read the complete article 😀

I am not kidding. It’s true. It will give you a better idea about the entire article flow.

Table of Contents

Let’s split the k-means clustering into two parts,

K-means
Clustering

Let’s learn clustering first. Then we will use this knowledge to understand k-means clustering.

What is clustering?

A cluster is a group of similar entities that are kept together. Their similarity decided by the feature they possess and how closely associated compared with the other entities to this feature.

Let’s say we have two points in a 2-d graph. Using the euclidean distance, we can measure how close these two points are located.

Likewise, using various similarity measures, we can find how close/similar the data points are.

All similar data points form clusters or groups. Creating these clusters in a meaningful way is called clustering.

In the machine learning world, clustering is the process in which we segregate a heap of data points into clusters on the basis of their features.

We will, discusses these features in the upcoming sections.

Clustering Real Life Example

One good real-life example for clustering is the world map. If you see here, each color represents a cluster. These clusters are created based on meaningful similarities.

For now, let’s say this similarity is distance. If you take any place in the cluster, it is closer to the center of that cluster compared with other clusters.

This is one of the main rules for creating clusters using any clustering algorithms.

Any point in the cluster should be closer to that cluster’s center and far from any other cluster.

In a more technical way, we can say the intra distance between the same points should be smaller compared with the inter points distance of different clusters.

Intra Distance: Distance between the same cluster points.
Inter Distance: Distance between different cluster points.

If the above statements are not clear, please go to the how to evaluate clusters section of this article. We provided an excellent visual example for this.

We hope the above sentence is clear by now. If not, read this sentence again. Once you have given the complete reading of the article.

How is Clustering Different from Classification?

As a data science beginner, the difference between clustering and classification is confusing. So as the initial step, let us understand the fundamental difference between classification and clustering.

For example,

let us say we have four categories:

Dog
Cat
Shark
Goldfish

In this scenario, clustering would make 2 clusters. The one who lives on land and the other one lives in water.

So the entities of the first cluster would be dogs and cats. Similarly, for the second cluster, it would be sharks and goldfishes.

But in classification, it would classify the four categories into four different classes. One for each category. So dogs would be classified under the class dog, and similarly, it would be for the rest.

In classification, we have labels to tell us and supervise whether the classification is right or not, and that is how we can classify them right. Thus making it a supervised learning algorithm.

But in clustering, despite distinctions, we cannot classify them because we don’t have labels for them. And that is why clustering is an unsupervised learning algorithm.

In real life we can expect high volume of data without labels, Because of such great use, clustering technique have may real-time situations to help. Let us understand that.

Clustering Applications

Below are the listed clustering applications.

Recommendation Engines

Clustering is widely used in recommendation engines to make clusters one’s likes and dislikes.

Image Segmentation

It clubs the pixels with similar values and segments them out from the rest of the image.

Customer Segmentation

People with similar choices are clustered and studied in one category. It helps the firm in ways like promoting things to the right audience, taking the right feedback.

Popular Clustering Algorithms

There are various clustering algorithms. Usage is dependent on their use cases. Below are the listed clustering algorithms.

KMeans
DBSCAN
Agglomerative Clustering
Gaussian Mixture Models
Spectral Clustering

But in this article, we are focusing exclusively on K-Means algorithm.

How K-Means Clustering Works

Let us understand the basic intuition of this unsupervised learning algorithm.

The intuition of the algorithm

Let us start by understanding what does this “k” means in K-means. K is a free parameter that is for addressing the number of clusters we want to have out of the given data points.

From all the content mentioned above, what we understand from a cluster is that we intend to have only those entities in one cluster who are similar to each other.

The same is for K means clustering. It is a clustering algorithm that aims to have similar entities in one cluster.

Well, you may ask, how does this algorithm decide whether an entity would lie in it or not?

So the answer to it is that it calculates the distance between its data points to the centroid of that cluster and aims to minimize the sum of all the distances(the distance of each data point from the centroid).

In short it uses smilarity measures to decide that.

One small thing that we need to understand is that the more the number of clusters, the less would be the sum of the distance of all the data points from the centroid.

This is because of the very reason that the number of data points in each cluster would decrease with an increase in the number of clusters.

And at a point where the number of clusters is equal to the number of data points, the sum of distance becomes zeros because the centroid is the data point itself!

Now let us see how it works. Please refer to the below image for all the steps.

Step 1

Here we are having a few data points, which we want to cluster. So we would start by picking the number of clusters we want to have for this case.

Let us select 2 for this instance. And then randomly selecting a point considering it to be the centroid of the cluster.

Step 2

We have successfully marked the centers of these clusters. Now we will be marking all the points with respective colors on the basis of the distance they have from the centroid.

Step 3

After marking all the data points, we will now be computing the centroid of this cluster again. We are doing it because initially, we had picked the centroid randomly. Then to remove error, if any, we are doing it.

The centroid of the cluster is computed by finding a point within the cluster that would be equidistant from all the data points.

Step 4

Now since we have computed the centroid again and we know it is not the same as it was before so we would iterate the process again and would find the points nearest to this centroid for each cluster.

Step 5

Now we have got the result again. One may ask when shall we stop the iteration of this finding the centroid and then placing the data points accordingly? Well, you have to do it till the time when the position of the centroids doesn’t change.

Step 6

We marked the two clusters.

In this case, it was easy, so we were able to get the results in 2 iterations only.

We had also talked about the random initialization that we are putting ourselves into. With this a problem we have is that it can land us up with some really bad clusters which won’t be of any use.

How To Evaluate Clusters

Intra Cluster Distance and Inter Cluster Distance

Let us also understand different evaluation metrics for clustering. In classification evaluation metrics helps in understanding how good the build is performing on the unseen data. In the same way we are having ways to determine the performance of the clusters created.

Of many, we would discuss 2 criteria for evaluating clusters.

Inertia
Dunn Index

Inertia

If you recall, we have discussed that it is very important for us to have similar entities in our cluster. So what it does basically calculates the sum of distances of all the entities present in the cluster.

Dunn Index

Here comes the concept of inter and intra cluster distance. Intra cluster distance is handled by inertia, and that is the distance between the data points which are inside one cluster. Inter cluster distance means the distance between 2 different clusters.

So dunn index is the ratio of the minimum inter cluster distance to the maximum of intra cluster distance.

So more will be the value of the dunn index better would be the clusters in terms of being separable.

How K-Means++ Clustering Works

So to pull ourselves out of this random initialization trap, we have kmeans++.

Let us also see how this thing really works.

Just like K means, here too we select the centroid randomly but the twist here is that there we used to select centroid for all the clusters and here we would be selecting the centroid randomly for only one cluster.
Now we would be computing the distance between every data point from that cluster.
Now comes the selecting of the cluster, here we would be choosing our second cluster by seeing which data point is the farthest from our centroid. Usually, we take the square of the distance just to be on a safer side.
Now repeat the above steps until the desired number(k) of clusters have been selected.

We would be having a look at the implementation of this and along with that would look at how can we decide the right amount of clusters for the same.

Key Differences Between K means and Kmeans++

K-Means

Takes less time to implement.
Randomly chooses two centroids.

K-Means++

Takes more time to implement
Chooses one centroid at random and other on the basis of the square of the distance from the first one.

Methods to identify “K” in K means clustering

How to identify the best “K” ?

There are several methods to find an optimal number of clusters for KMeans clusters. To popular methods are

Elbow Method
Silhouette Method

Elbow Method

It is to calculate the sum the distances from data points to centroids and aims at minimising the sum to an optimal value.

Silhouette Method

The silhouette value measures how similar a point is to its own cluster (cohesion) compared to other clusters (separation).

Here we would be looking at the Elbow method. Details about the same are mentioned in the problem statement below.

K-means Clustering Implementation in Python

It is a problem to cluster people on the basis of their spending scores and income. In this problem, you will understand the dataset.

Also you will learn about how the elbow method determines the right number of cluster. At the we will learn the python implementation K-Means clustering and plotting the clusters

You can download the dataset from here.

Before jumping into this, we need to understand what exactly is wcss doing?

Wcss stands for the within-cluster sum of squares. Which is just a high-fi name for finding the sum of distances of all the data points to the centroid of the cluster.

In the code segment below, it would be starting off with 1 cluster and would go till 10. Always remember we want the sum of this distance to be as minimum as possible in a way where the number of data points in that cluster remains constant.

Output elbow method.

Here we are able to see a considerable decline in the value of WCSS after cluster 5. So this means that the optimal number of clusters is 5.

Conclusion

In this article we explained or provided a brief idea about k-means clustering. Also explained how clustering is different from classification, how we can evaluate clusters.

This gives the complete flow of how the K means algorithms works.

In that we had also seen more about the random initialisation trap and how can we use kmeans++ to pull ourselves out of it.

Lastly we had taken a look at a clustering based problem statement, which involved the concepts of choosing the right number of clusters and how to visualise it.