How the Hierarchical Clustering Algorithm Works
Hierarchical Clustering algorithm is an unsupervised Learning Algorithm, and this is one of the most popular clustering technique in Machine Learning.
Expectations of getting insights from machine learning algorithms is increasing abruptly. Initially, we were limited to predict the future by feeding historical data.
This is easy when the expected results and the features in the historical data are available to build the supervised learning models, which can predict the future.
- For example predicting the email is spam or not, using the historical email data.
Learn hierarchical clustering algorithm in detail also, learn about agglomeration and divisive way of hierarchical clustering. #clustering #hierarchicalclustering
But the real world problems are not limited to supervised type, and we do get the unsupervised problems too.
How to build the models for such problems?
Where comes the unsupervised learning algorithms.
In this article, we are going to learn one such popular unsupervised learning algorithm which is hierarchical clustering algorithm.
Before we start learning, Let’s look at the topics you will learn in this article. Only if you read the complete article 🙂
Before we understand what hierarchical clustering algorithm is, its benefits, and how it works. Let us learn the unsupervised learning algorithm topic.
What is Unsupervised Learning
Unsupervised learning is training a machine using information that is neither classified nor labeled and allows the machine to act on that information without guidance.
In Unsupervised Learning, a machine’s task is to group unsorted information according to similarities, patterns, and differences without any prior data training. It is defined as
“Unsupervised Learning Algorithm is a machine learning technique, where you don’t have to supervise the model. Rather, you need to allow the model to work on its own to discover information, and It mainly deals with unlabelled data.”
If you want to know more, we would suggest you to read the unsupervised learning algorithms article.
Types of Unsupervised Learning Algorithm
Unsupervised Learning algorithms are classified into two categories.
- Clustering Algorithms
- Association Rule Algorithms
Clustering Algorithms: Clustering is a technique of grouping objects into clusters. Objects with the most similarities remain in a group and have less or no similarities with another group’s objects.
Association Rule Algorithms: Association rule in unsupervised learning method, which helps in finding the relationships between variables in a large database.
Unsupervised Learning Algorithms
The list of some popular Unsupervised Learning algorithms are:
- K-means Clustering
- Hierarchical Clustering
- Principal Component Analysis
- Apriori Algorithm
- Anomaly detection
- Independent Component Analysis
- Singular value decomposition
Before we learn about hierarchical clustering, we need to know about clustering and how it is different from classification.
What is Clustering
Clustering is an important technique when it comes to the unsupervised learning algorithm. Clustering mainly deals with finding a structure or pattern in a collection of uncategorized data.
It is a technique that groups similar objects such that objects in the same group are identical to each other than the objects in the other groups. The group of similar objects is called a Cluster.
How is clustering different from classification?
As a data science beginner, the difference between clustering and classification is confusing. So as the initial step, let us understand the fundamental difference between classification and clustering.
For example,
Let us say we have four categories:
- Dog
- Cat
- Shark
- Goldfish
In this scenario, clustering would make 2 clusters. The one who lives on land and the other one lives in water.
So the entities of the first cluster would be dogs and cats. Similarly, for the second cluster, it would be sharks and goldfishes.
But in classification, it would classify the four categories into four different classes. One for each category.
So dogs would be classified under the class dog, and similarly, it would be for the rest.
In classification, we have labels to tell us and supervise whether the classification is right or not, and that is how we can classify them right. Thus making it a supervised learning algorithm.
But in clustering, despite distinctions, we cannot classify them because we don’t have labels for them. And that is why clustering is an unsupervised learning algorithm.
In real life, we can expect high volumes of data without labels. Because of such great use, clustering techniques have many real-time situations to help. Let us understand that.
Applications of Clustering
Clustering has a large number of applications spread across various domains. Some of the most popular applications of clustering are:
- Recommendation Engines
- Clustering similar news articles
- Medical Imaging
- Image Segmentation
- Anomaly detection
- Pattern Recognition
Till now, we got the in depth idea of what is unsupervised learning and its types. We also learned what clustering and various applications of the clustering algorithm.
Now have a look at a detailed explanation of what is hierarchical clustering and why it is used?
What is Hierarchical Clustering
Hierarchical clustering is one of the popular clustering techniques after K-means Clustering. It is also known as Hierarchical Clustering Analysis (HCA)
Which is used to group unlabelled datasets into a Cluster. This Hierarchical Clustering technique builds clusters based on the similarity between different objects in the set.
It goes through the various features of the data points and looks for the similarity between them.
This process will continue until the dataset has been grouped. Which creates a hierarchy for each of these clusters.
Hierarchical Clustering deals with the data in the form of a tree or a well-defined hierarchy.
Because of this reason, the algorithm is named as a hierarchical clustering algorithm.
This hierarchy way of clustering can be performed in two ways.
- Agglomerative: Hierarchy created from bottom to top.
- Divisive: Hierarchy created from top to bottom.
In the next section of this article, let’s learn about these two ways in detail. For now, the above image gives you a high level of understanding.
In the early sections of this article, we were given various algorithms to perform the clustering. But how is this hierarchical clustering different from other techniques?
Let’s discuss that.
Why Hierarchical Clustering
As we already have some clustering algorithms such as K-Means Clustering, then why do we need Hierarchical Clustering?
As we have already seen in the K-Means Clustering algorithm article, it uses a pre-specified number of clusters. It requires advanced knowledge of K., i.e., how to define the number of clusters one wants to divide your data.
Still, in hierarchical clustering no need to pre-specify the number of clusters as we did in the K-Means Clustering; one can stop at any number of clusters.
Furthermore, Hierarchical Clustering has an advantage over K-Means Clustering. i.e., it results in an attractive tree-based representation of the observations, called a Dendrogram.
Types of Hierarchical Clustering
The Hierarchical Clustering technique has two types.
- Agglomerative Hierarchical Clustering
- Start with points as individual clusters.
- At each step, it merges the closest pair of clusters until only one cluster ( or K clusters left).
- Divisive Hierarchical Clustering
- Start with one, all-inclusive cluster.
- At each step, it splits a cluster until each cluster contains a point ( or there are clusters).
Agglomerative Clustering
It is also known as AGNES ( Agglomerative Nesting) and follows the bottom-up approach.
Each observation starts with its own cluster, and pairs of clusters are merged as one moves up the hierarchy.
That means the algorithm considers each data point as a single cluster initially and then starts combining the closest pair of clusters together.
It does the same process until all the clusters are merged into a single cluster that contains all the datasets.
How does Agglomerative Hierarchical Clustering work
Let’s take a sample of data and learn how the agglomerative hierarchical clustering work step by step.
Step 1
First, make each data point a “single - cluster,” which forms N clusters. (let’s assume there are N numbers of clusters).
Step 2
Take the next two closest data points and make them one cluster; now, it forms N-1 clusters.
Step 3
Again, take the two clusters and make them one cluster; now, it forms N-2 clusters.
Step 4
Repeat ‘Step 3’ until you are left with only one cluster.
Once all the clusters are combined into a big cluster. We develop the Dendrogram to divide the clusters.
For the divisive hierarchical clustering, it treats all the data points as one cluster and splits the clustering until it creates meaningful clusters.
Difference ways to measure the distance between two clusters
There are several ways to measure the distance between in order to decide the rules for clustering, and they are often called Linkage Methods.
Some of the popular linkage methods are:
- Simple Linkage
- Complete Linkage
- Average Linkage
- Centroid Linkage
- Ward’s Linkage
Simple Linkage
Simple Linkage is also known as the Minimum Linkage (MIN) method.
In the Single Linkage method, the distance of two clusters is defined as the minimum distance between an object (point) in one cluster and an object (point) in the other cluster. This method is also known as the nearest neighbor method.
Pros and Cons of Simple Linkage method
Pros of Simple Linkage
Cons of Simple Linkage
Complete Linkage
The complete Linkage method is also known as the Maximum Linkage (MAX) method.
In the Complete Linkage technique, the distance between two clusters is defined as the maximum distance between an object (point) in one cluster and an object (point) in the other cluster.
And this method is also known as the furthest neighbor method.
Pros and Cons of Complete Linkage method
Pros of Complete Linkage
Cons of Complete Linkage
Average Linkage
In the Average Linkage technique, the distance between two clusters is the average distance between each cluster’s point to every point in the other cluster.
This method is also known as the unweighted pair group method with arithmetic mean.
Pros and Cons of the Average Linkage method
Pros of Average Linkage
Cons of Average Linkage
Centroid Linkage
In the Centroid Linkage approach, the distance between the two sets or clusters is the distance between two mean vectors of the sets (clusters).
At each stage, we combine the two sets that have the smallest centroid distance. In simple words, it is the distance between the centroids of the two sets.
Pros and Cons of Centroid Linkage method
Pros of Centroid Linkage
Cons of Centroid Linkage
Ward’s Linkage
Ward's Linkage method is the similarity of two clusters. Which is based on the increase in squared error when two clusters are merged, and it is similar to the group average if the distance between points is distance squared.
Pros and Cons of Ward’s Linkage method
Pros of Ward's Linkage
Cons of Ward's Linkage
Some of the other linkage methods are:
- Strong Linkage
- Flexible linkage
- Simple Average
The Linkage method’s choice depends on you, and you can apply any of them according to the type of problem, and different linkage methods lead to different clusters.
Below is the comparison image, which shows all the linkage methods. We took this reference image from greatlearning platform blog.
Hierarchical Clustering algorithms generate clusters that are organized into hierarchical structures.
These hierarchical structures can be visualized using a tree-like diagram called Dendrogram.
Now let us discuss Dendrogram.
What is Dendrogram
A Dendrogram is a diagram that represents the hierarchical relationship between objects. The Dendrogram is used to display the distance between each pair of sequentially merged objects.
These are commonly used in studying hierarchical clusters before deciding the number of clusters significant to the dataset.
The distance at which the two clusters combine is referred to as the dendrogram distance.
The primary use of a dendrogram is to work out the best way to allocate objects to clusters.
The key point to interpreting or implementing a dendrogram is to focus on the closest objects in the dataset.
Hence from the above figure, we can observe that the objects P6 and P5 are very close to each other, merging them into one cluster named C1, and followed by the object P4 is closed to the cluster C1, so combine these into a cluster (C2).
And the objects P1 and P2 are close to each other so merge them into one cluster (C3), now cluster C3 is merged with the following object P0 and forms a cluster (C4), the object P3 is merged with the cluster C2, and finally the cluster C2 and C4 and merged into a single cluster (C6).
Till now, we have a clear idea of the Agglomerative Hierarchical Clustering and Dendrograms.
Now let us implement python code for the Agglomerative clustering technique.
Agglomerative Clustering Algorithm Implementation in Python
Let us have a look at how to apply a hierarchical cluster in python on a Mall_Customers dataset.
If you remembered, we have used the same dataset in the k-means clustering algorithms implementation too.
Please refer to k-means article for getting the dataset.
Importing the libraries and loading the data
We are importing all the necessary libraries, then we will load the data.
Dendrogram to find the optimal number of clusters
Training the Hierarchical Clustering model on the dataset
Now, we are training our dataset using Agglomerative Hierarchical Clustering.
Advantages and Disadvantages of Agglomerative Hierarchical Clustering Algorithm
Advantages
Disadvantages
Divisive Hierarchical Clustering
Divisive Hierarchical Clustering is also known as DIANA (Divisive Clustering Analysis.)
It is a top-down clustering approach. It works as similar as Agglomerative Clustering but in the opposite direction.
This approach starts with a single cluster containing all objects and then splits the cluster into two least similar clusters based on their characteristics. We proceed with the same process until there is one cluster for each observation.
Here, the divisive approach method is known as rigid, i.e., once a splitting is done on clusters, we can't revert it.
Steps to perform Divisive Clustering
- Initially, all the objects or points in the dataset belong to one single cluster.
- Partition the single cluster into two least similar clusters.
- And continue this process to form the new clusters until the desired number of clusters means one cluster for each observation.
Strengths and Limitations of Hierarchical Clustering Algorithm
For every algorithm, we do have strengths and limitations. If we don't know about these, we end up using these algorithms in the cases where they are limited not to use. So let’s learn this as well.
Strengths of Hierarchical Clustering
- It is to understand and implement.
- We don’t have to pre-specify any particular number of clusters.
- Can obtain any desired number of clusters by cutting the Dendrogram at the proper level.
- They may correspond to meaningful classification.
- Easy to decide the number of clusters by merely looking at the Dendrogram.
Limitations of Hierarchical Clustering
- Hierarchical Clustering does not work well on vast amounts of data.
- All the approaches to calculate the similarity between clusters have their own disadvantages.
- In hierarchical Clustering, once a decision is made to combine two clusters, it can not be undone.
- Different measures have problems with one or more of the following.
- Sensitivity to noise and outliers.
- Faces Difficulty when handling with different sizes of clusters.
- It is breaking large clusters.
- In this technique, the order of the data has an impact on the final results.
Conclusion
In this article, we discussed the hierarchical cluster algorithm’s in-depth intuition and approaches, such as the Agglomerative Clustering and Divisive Clustering approach.
Hierarchical Clustering is often used in the form of descriptive rather than predictive modeling.
Mostly we use Hierarchical Clustering algorithm when the application requires a hierarchy. The advantage of Hierarchical Clustering is we don’t have to pre-specify the clusters.
However, it doesn’t work very well on vast amounts of data or huge datasets. And there are some disadvantages of the Hierarchical Clustering algorithm that it is not suitable for large datasets. And it gives the best results in some cases only.
Frequently Asked Questions (FAQs) On Hierarchical Clustering Algorithm
1. What is Hierarchical Clustering?
Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters by either successively splitting or merging them.
2. How Does Hierarchical Clustering Differ from K-Means Clustering?
Unlike K-means which requires specifying the number of clusters beforehand, hierarchical clustering doesn't need that. It builds a tree of clusters, allowing for a visual representation of the data hierarchy.
3. What are Agglomerative and Divisive Hierarchical Clustering?
Agglomerative (or bottom-up) starts with each data point as a separate cluster and merges them step by step. Divisive (or top-down) starts with all data points as one cluster and divides them.
4. How Are Clusters Merged in the Agglomerative Approach?
In the agglomerative approach, at each step, the two clusters that are closest to each other are merged.
5. Which Distance Metrics are Used in Hierarchical Clustering?
Common distance metrics include Euclidean, Manhattan, and Cosine. The choice of distance metric can influence the shape and structure of clusters.
6. What is a Dendrogram in Hierarchical Clustering?
A dendrogram is a tree-like diagram that illustrates the arrangement of clusters created by hierarchical clustering. It helps determine the number of clusters by cutting the tree at various levels.
7. How Do I Determine the Optimal Number of Clusters?
By analyzing the dendrogram! Large jumps in distances in the dendrogram can suggest an optimal number of clusters.
8. What are Linkage Methods in Hierarchical Clustering?
Linkage methods determine the distance between clusters. Common methods are single linkage (minimum pairwise distance), complete linkage (maximum pairwise distance), average linkage, and Ward's linkage (minimize variance).
9. How Scalable is Hierarchical Clustering?
Hierarchical clustering tends to be computationally intensive, especially for larger datasets. It's often more suited to smaller datasets or when the hierarchical structure is of specific interest.
10. Is It Necessary to Standardize Data Before Hierarchical Clustering?
Often, yes. Especially when the dataset's features have different units or scales. Standardizing ensures that each feature contributes equally to the distance computation.
11. Can I Use Hierarchical Clustering for Large Datasets?
While possible, it's computationally demanding. For very large datasets, one might consider sampling or using more scalable clustering algorithms.
12. How Robust is Hierarchical Clustering to Outliers?
Its sensitivity to outliers depends on the linkage method. For instance, a single linkage can be sensitive to noise and outliers, whereas Ward's method or complete linkage can be more robust.
Recommended Courses
Machine Learning Course
Rating: 4.6/5
Deep Learning Course
Rating: 4.5/5
NLP Course
Rating: 4.5/5
Thanks for writing simple article. We hope you try to write much more quality articles like this.
We wish you happy writing!
The concept is clearly explained and easily understandable.
For unsupervised learning, clustering is very important and the presentation of the article has made me to know hierarchical clustering importance and implementation in real world scenario problems.
Many thanks to the author-shaik irfana for her valuable efforts.
Thanks! Azam.
We wish you a happy learning!
Very well explained. The article is elegant and has a smooth flow.
Thanks! Mir Quadri
Thank! Quadri
Good explanation for all type of lerners and word presentation is very simple and understanding keep it top and more topics can explain for lerners.All the best for more useful topics
Sabiha,
Thanks for your kind words. We try to write much more quality articles like these.
We wish you happy learning!
Thanks! Sabiha 🙂
Looking a great work dear Very well explanation theoretical and Code part I really appreciate you, keep it up 👏👏👏👌👌💐💐
Thanks Rajeswari,
We are glad that you like the article, much more coming. Please visit the site regularly.
Do you have a topic in mind that we can write please let us know.
Thanks, and we wish you a happy learning.
Thanks! Rajeswari
I never seen this type of explanation because this content very useful to who want to learn quickly in an easy way keep it up and we are waiting for your new article in such a way.
Thanks, Ravi Kumar,
We always go one step ahead to create the quality content. Sure, much more are coming on the way.
Thanks, and happy learning!
Thanks! Ravi Kumar
Good explanation with minimal use of words..
Thanks! Ahmed
Excellent presentation skills, u written in easy way to get it. Keep it up👏👏👏👌
Thanks Sree hari.
Thanks! Sree hari
very well explanation thory and coding part
keep it up irfana
very well explanation along with theoretical and code part,
keep going irfana,
Thanks! Chowdary.
We are glad that you liked our article. We wish you happy learning.
Thanks! Chowdary
Well detailed theory along with practical coding, Irfana.
Keep them coming!
Well detailed theory along with practical coding, Irfana.
Keep up the work! Expecting more of such articles.
Thanks, Sandeep,
Many more amazing articles are on the way.