How the Hierarchical Clustering Algorithm Works

December 21, 2020 Shaik Irfana Sultana

Hierarchical Clustering algorithm is an unsupervised Learning Algorithm, and this is one of the most popular clustering technique in Machine Learning.

Expectations of getting insights from machine learning algorithms is increasing abruptly. Initially, we were limited to predict the future by feeding historical data.

This is easy when the expected results and the features in the historical data are available to build the supervised learning models, which can predict the future.

For example predicting the email is spam or not, using the historical email data.

Learn hierarchical clustering algorithm in detail also, learn about agglomeration and divisive way of hierarchical clustering. #clustering #hierarchicalclustering

Click to Tweet

But the real world problems are not limited to supervised type, and we do get the unsupervised problems too.

How to build the models for such problems?

Where comes the unsupervised learning algorithms.

In this article, we are going to learn one such popular unsupervised learning algorithm which is hierarchical clustering algorithm.

Before we start learning, Let’s look at the topics you will learn in this article. Only if you read the complete article 🙂

Table of Contents

Before we understand what hierarchical clustering algorithm is, its benefits, and how it works. Let us learn the unsupervised learning algorithm topic.

What is Unsupervised Learning

Unsupervised learning is training a machine using information that is neither classified nor labeled and allows the machine to act on that information without guidance.

In Unsupervised Learning, a machine’s task is to group unsorted information according to similarities, patterns, and differences without any prior data training. It is defined as

“Unsupervised Learning Algorithm is a machine learning technique, where you don’t have to supervise the model. Rather, you need to allow the model to work on its own to discover information, and It mainly deals with unlabelled data.”

If you want to know more, we would suggest you to read the unsupervised learning algorithms article.

Types of Unsupervised Learning Algorithm

Unsupervised Learning algorithms are classified into two categories.

Clustering Algorithms
Association Rule Algorithms

Clustering Algorithms: Clustering is a technique of grouping objects into clusters. Objects with the most similarities remain in a group and have less or no similarities with another group’s objects.

Association Rule Algorithms: Association rule in unsupervised learning method, which helps in finding the relationships between variables in a large database.

Unsupervised Learning Algorithms

The list of some popular Unsupervised Learning algorithms are:

K-means Clustering
Hierarchical Clustering
Principal Component Analysis
Apriori Algorithm
Anomaly detection
Independent Component Analysis
Singular value decomposition

Before we learn about hierarchical clustering, we need to know about clustering and how it is different from classification.

What is Clustering

Clustering is an important technique when it comes to the unsupervised learning algorithm. Clustering mainly deals with finding a structure or pattern in a collection of uncategorized data.

It is a technique that groups similar objects such that objects in the same group are identical to each other than the objects in the other groups. The group of similar objects is called a Cluster.

How is clustering different from classification?

As a data science beginner, the difference between clustering and classification is confusing. So as the initial step, let us understand the fundamental difference between classification and clustering.

For example,

Let us say we have four categories:

Dog
Cat
Shark
Goldfish

In this scenario, clustering would make 2 clusters. The one who lives on land and the other one lives in water.

So the entities of the first cluster would be dogs and cats. Similarly, for the second cluster, it would be sharks and goldfishes.

But in classification, it would classify the four categories into four different classes. One for each category.

So dogs would be classified under the class dog, and similarly, it would be for the rest.

In classification, we have labels to tell us and supervise whether the classification is right or not, and that is how we can classify them right. Thus making it a supervised learning algorithm.

But in clustering, despite distinctions, we cannot classify them because we don’t have labels for them. And that is why clustering is an unsupervised learning algorithm.

In real life, we can expect high volumes of data without labels. Because of such great use, clustering techniques have many real-time situations to help. Let us understand that.

Applications of Clustering

Clustering has a large number of applications spread across various domains. Some of the most popular applications of clustering are:

Recommendation Engines
Clustering similar news articles
Medical Imaging
Image Segmentation
Anomaly detection
Pattern Recognition

Till now, we got the in depth idea of what is unsupervised learning and its types. We also learned what clustering and various applications of the clustering algorithm.

Now have a look at a detailed explanation of what is hierarchical clustering and why it is used?

What is Hierarchical Clustering

Hierarchical clustering is one of the popular clustering techniques after K-means Clustering. It is also known as Hierarchical Clustering Analysis (HCA)

Which is used to group unlabelled datasets into a Cluster. This Hierarchical Clustering technique builds clusters based on the similarity between different objects in the set.

It goes through the various features of the data points and looks for the similarity between them.

This process will continue until the dataset has been grouped. Which creates a hierarchy for each of these clusters.

Hierarchical Clustering deals with the data in the form of a tree or a well-defined hierarchy.

Hierarchical Clustering Types Agglomerative and Divisive

Because of this reason, the algorithm is named as a hierarchical clustering algorithm.

This hierarchy way of clustering can be performed in two ways.

Agglomerative: Hierarchy created from bottom to top.
Divisive: Hierarchy created from top to bottom.

In the next section of this article, let’s learn about these two ways in detail. For now, the above image gives you a high level of understanding.

In the early sections of this article, we were given various algorithms to perform the clustering. But how is this hierarchical clustering different from other techniques?

Let’s discuss that.

Why Hierarchical Clustering

As we already have some clustering algorithms such as K-Means Clustering, then why do we need Hierarchical Clustering?

As we have already seen in the K-Means Clustering algorithm article, it uses a pre-specified number of clusters. It requires advanced knowledge of K., i.e., how to define the number of clusters one wants to divide your data.

Still, in hierarchical clustering no need to pre-specify the number of clusters as we did in the K-Means Clustering; one can stop at any number of clusters.

Furthermore, Hierarchical Clustering has an advantage over K-Means Clustering. i.e., it results in an attractive tree-based representation of the observations, called a Dendrogram.

Types of Hierarchical Clustering

The Hierarchical Clustering technique has two types.

Agglomerative Hierarchical Clustering
- Start with points as individual clusters.
- At each step, it merges the closest pair of clusters until only one cluster ( or K clusters left).
Divisive Hierarchical Clustering
- Start with one, all-inclusive cluster.
- At each step, it splits a cluster until each cluster contains a point ( or there are clusters).

Agglomerative Clustering

It is also known as AGNES ( Agglomerative Nesting) and follows the bottom-up approach.

Each observation starts with its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

That means the algorithm considers each data point as a single cluster initially and then starts combining the closest pair of clusters together.

It does the same process until all the clusters are merged into a single cluster that contains all the datasets.

How does Agglomerative Hierarchical Clustering work

Let’s take a sample of data and learn how the agglomerative hierarchical clustering work step by step.

Step 1

First, make each data point a “single - cluster,” which forms N clusters. (let’s assume there are N numbers of clusters).

Step 2

Take the next two closest data points and make them one cluster; now, it forms N-1 clusters.

Step 3

Again, take the two clusters and make them one cluster; now, it forms N-2 clusters.

Step 4

Repeat ‘Step 3’ until you are left with only one cluster.

Once all the clusters are combined into a big cluster. We develop the Dendrogram to divide the clusters.

For the divisive hierarchical clustering, it treats all the data points as one cluster and splits the clustering until it creates meaningful clusters.

Difference ways to measure the distance between two clusters

There are several ways to measure the distance between in order to decide the rules for clustering, and they are often called Linkage Methods.

Some of the popular linkage methods are:

Simple Linkage
Complete Linkage
Average Linkage
Centroid Linkage
Ward’s Linkage

Simple Linkage

Simple Linkage is also known as the Minimum Linkage (MIN) method.

In the Single Linkage method, the distance of two clusters is defined as the minimum distance between an object (point) in one cluster and an object (point) in the other cluster. This method is also known as the nearest neighbor method.

Pros and Cons of Simple Linkage method

Pros of Simple Linkage

Simple Linkage methods can handle non-elliptical shapes.
Single Linkage algorithms are the best for capturing clusters of different sizes.

Cons of Simple Linkage

Simple Linkage methods are sensitive to noise and outliers.
That means Simple Linkage methods can not group clusters properly if there is any noise between the clusters.

Complete Linkage

The complete Linkage method is also known as the Maximum Linkage (MAX) method.

In the Complete Linkage technique, the distance between two clusters is defined as the maximum distance between an object (point) in one cluster and an object (point) in the other cluster.

And this method is also known as the furthest neighbor method.

Pros and Cons of Complete Linkage method

Pros of Complete Linkage

Complete Linkage algorithms are less susceptible to noise and outliers.
That means the Complete Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Complete Linkage

Complete linkage methods tend to break large clusters.
Complete Linkage is biased towards globular clusters.

Average Linkage

In the Average Linkage technique, the distance between two clusters is the average distance between each cluster’s point to every point in the other cluster.

This method is also known as the unweighted pair group method with arithmetic mean.

Pros and Cons of the Average Linkage method

Pros of Average Linkage

The average Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Average Linkage

The average Linkage method is biased towards globular clusters.

Centroid Linkage

In the Centroid Linkage approach, the distance between the two sets or clusters is the distance between two mean vectors of the sets (clusters).

At each stage, we combine the two sets that have the smallest centroid distance. In simple words, it is the distance between the centroids of the two sets.

Pros and Cons of Centroid Linkage method

Pros of Centroid Linkage

The Centroid Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Centroid Linkage

Similar to Complete Linkage and Average Linkage methods, the Centroid Linkage method is also biased towards globular clusters.

Ward’s Linkage

Ward's Linkage method is the similarity of two clusters. Which is based on the increase in squared error when two clusters are merged, and it is similar to the group average if the distance between points is distance squared.

Pros and Cons of Ward’s Linkage method

Pros of Ward's Linkage

In many cases, Ward’s Linkage is preferred as it usually produces better cluster hierarchies
Ward’s method is less susceptible to noise and outliers.

Cons of Ward's Linkage

Ward’s linkage method is biased towards globular clusters.

Some of the other linkage methods are:

Strong Linkage
Flexible linkage
Simple Average

The Linkage method’s choice depends on you, and you can apply any of them according to the type of problem, and different linkage methods lead to different clusters.

Below is the comparison image, which shows all the linkage methods. We took this reference image from greatlearning platform blog.

Hierarchical Clustering algorithms generate clusters that are organized into hierarchical structures.

These hierarchical structures can be visualized using a tree-like diagram called Dendrogram.

Now let us discuss Dendrogram.

What is Dendrogram

A Dendrogram is a diagram that represents the hierarchical relationship between objects. The Dendrogram is used to display the distance between each pair of sequentially merged objects.

These are commonly used in studying hierarchical clusters before deciding the number of clusters significant to the dataset.

The distance at which the two clusters combine is referred to as the dendrogram distance.

The primary use of a dendrogram is to work out the best way to allocate objects to clusters.

The key point to interpreting or implementing a dendrogram is to focus on the closest objects in the dataset.

Hence from the above figure, we can observe that the objects P6 and P5 are very close to each other, merging them into one cluster named C1, and followed by the object P4 is closed to the cluster C1, so combine these into a cluster (C2).

And the objects P1 and P2 are close to each other so merge them into one cluster (C3), now cluster C3 is merged with the following object P0 and forms a cluster (C4), the object P3 is merged with the cluster C2, and finally the cluster C2 and C4 and merged into a single cluster (C6).

Till now, we have a clear idea of the Agglomerative Hierarchical Clustering and Dendrograms.

Now let us implement python code for the Agglomerative clustering technique.

Agglomerative Clustering Algorithm Implementation in Python

Let us have a look at how to apply a hierarchical cluster in python on a Mall_Customers dataset.

If you remembered, we have used the same dataset in the k-means clustering algorithms implementation too.

Please refer to k-means article for getting the dataset.

Importing the libraries and loading the data

We are importing all the necessary libraries, then we will load the data.

Dendrogram to find the optimal number of clusters

Training the Hierarchical Clustering model on the dataset

Now, we are training our dataset using Agglomerative Hierarchical Clustering.

Advantages and Disadvantages of Agglomerative Hierarchical Clustering Algorithm

Advantages

The agglomerative technique is easy to implement.
It can produce an ordering of objects, which may be informative for the display.
In agglomerative Clustering, there is no need to pre-specify the number of clusters.
By the Agglomerative Clustering approach, smaller clusters will be created, which may discover similarities in data.

Disadvantages

The agglomerative technique gives the best result in some cases only.
The algorithm can never undo what was done previously, which means if the objects may have been incorrectly grouped at an earlier stage, and the same result should be close to ensure it.
The usage of various distance metrics for measuring distances between the clusters may produce different results. So performing multiple experiments and then comparing the result is recommended to help the actual results’ veracity.

Divisive Hierarchical Clustering

Divisive Hierarchical Clustering is also known as DIANA (Divisive Clustering Analysis.)

It is a top-down clustering approach. It works as similar as Agglomerative Clustering but in the opposite direction.

This approach starts with a single cluster containing all objects and then splits the cluster into two least similar clusters based on their characteristics. We proceed with the same process until there is one cluster for each observation.

Here, the divisive approach method is known as rigid, i.e., once a splitting is done on clusters, we can't revert it.

Steps to perform Divisive Clustering

Initially, all the objects or points in the dataset belong to one single cluster.
Partition the single cluster into two least similar clusters.
And continue this process to form the new clusters until the desired number of clusters means one cluster for each observation.

Strengths and Limitations of Hierarchical Clustering Algorithm

For every algorithm, we do have strengths and limitations. If we don't know about these, we end up using these algorithms in the cases where they are limited not to use. So let’s learn this as well.

Strengths of Hierarchical Clustering

It is to understand and implement.
We don’t have to pre-specify any particular number of clusters.
- Can obtain any desired number of clusters by cutting the Dendrogram at the proper level.
They may correspond to meaningful classification.
Easy to decide the number of clusters by merely looking at the Dendrogram.

Limitations of Hierarchical Clustering

Hierarchical Clustering does not work well on vast amounts of data.
All the approaches to calculate the similarity between clusters have their own disadvantages.
In hierarchical Clustering, once a decision is made to combine two clusters, it can not be undone.
Different measures have problems with one or more of the following.
- Sensitivity to noise and outliers.
- Faces Difficulty when handling with different sizes of clusters.
- It is breaking large clusters.
- In this technique, the order of the data has an impact on the final results.

Conclusion

In this article, we discussed the hierarchical cluster algorithm’s in-depth intuition and approaches, such as the Agglomerative Clustering and Divisive Clustering approach.

Hierarchical Clustering is often used in the form of descriptive rather than predictive modeling.

Mostly we use Hierarchical Clustering algorithm when the application requires a hierarchy. The advantage of Hierarchical Clustering is we don’t have to pre-specify the clusters.

However, it doesn’t work very well on vast amounts of data or huge datasets. And there are some disadvantages of the Hierarchical Clustering algorithm that it is not suitable for large datasets. And it gives the best results in some cases only.

Frequently Asked Questions (FAQs) On Hierarchical Clustering Algorithm

1. What is Hierarchical Clustering?

Hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters by either successively splitting or merging them.

2. How Does Hierarchical Clustering Differ from K-Means Clustering?

Unlike K-means which requires specifying the number of clusters beforehand, hierarchical clustering doesn't need that. It builds a tree of clusters, allowing for a visual representation of the data hierarchy.

3. What are Agglomerative and Divisive Hierarchical Clustering?

Agglomerative (or bottom-up) starts with each data point as a separate cluster and merges them step by step. Divisive (or top-down) starts with all data points as one cluster and divides them.

4. How Are Clusters Merged in the Agglomerative Approach?

In the agglomerative approach, at each step, the two clusters that are closest to each other are merged.

5. Which Distance Metrics are Used in Hierarchical Clustering?

Common distance metrics include Euclidean, Manhattan, and Cosine. The choice of distance metric can influence the shape and structure of clusters.

6. What is a Dendrogram in Hierarchical Clustering?

A dendrogram is a tree-like diagram that illustrates the arrangement of clusters created by hierarchical clustering. It helps determine the number of clusters by cutting the tree at various levels.

7. How Do I Determine the Optimal Number of Clusters?

By analyzing the dendrogram! Large jumps in distances in the dendrogram can suggest an optimal number of clusters.

8. What are Linkage Methods in Hierarchical Clustering?

Linkage methods determine the distance between clusters. Common methods are single linkage (minimum pairwise distance), complete linkage (maximum pairwise distance), average linkage, and Ward's linkage (minimize variance).

9. How Scalable is Hierarchical Clustering?

Hierarchical clustering tends to be computationally intensive, especially for larger datasets. It's often more suited to smaller datasets or when the hierarchical structure is of specific interest.

10. Is It Necessary to Standardize Data Before Hierarchical Clustering?

Often, yes. Especially when the dataset's features have different units or scales. Standardizing ensures that each feature contributes equally to the distance computation.

11. Can I Use Hierarchical Clustering for Large Datasets?

While possible, it's computationally demanding. For very large datasets, one might consider sampling or using more scalable clustering algorithms.

12. How Robust is Hierarchical Clustering to Outliers?

Its sensitivity to outliers depends on the linkage method. For instance, a single linkage can be sensitive to noise and outliers, whereas Ward's method or complete linkage can be more robust.

Recommended Courses

Recommended

Machine Learning Course

Rating: 4.6/5

Learn Now

Deep Learning Course

Rating: 4.5/5

Learn Now

NLP Course

Rating: 4.5/5

Learn Now

27 Responses to “How the Hierarchical Clustering Algorithm Works”

Muhammad Azam
2 years ago
Reply

Thanks for writing simple article. We hope you try to write much more quality articles like this.

We wish you happy writing!
M.A.Azam
3 years ago
Reply

The concept is clearly explained and easily understandable.
For unsupervised learning, clustering is very important and the presentation of the article has made me to know hierarchical clustering importance and implementation in real world scenario problems.
Many thanks to the author-shaik irfana for her valuable efforts.
- Shaik Irfana Sultana
  3 years ago
  Reply
  
  Thanks! Azam.
  
  We wish you a happy learning!
Mir Quadri
3 years ago
Reply

Very well explained. The article is elegant and has a smooth flow.
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Thanks! Mir Quadri
- Shaik Irfana Sultana
  3 years ago
  Reply
  
  Thank! Quadri
Sabiha shaik
3 years ago
Reply

Good explanation for all type of lerners and word presentation is very simple and understanding keep it top and more topics can explain for lerners.All the best for more useful topics
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Sabiha,
  
  Thanks for your kind words. We try to write much more quality articles like these.
  
  We wish you happy learning!
- Shaik Irfana Sultana
  3 years ago
  Reply
  
  Thanks! Sabiha 🙂
Rajeswari katta
3 years ago
Reply

Looking a great work dear Very well explanation theoretical and Code part I really appreciate you, keep it up 👏👏👏👌👌💐💐
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Thanks Rajeswari,
  
  We are glad that you like the article, much more coming. Please visit the site regularly.
  Do you have a topic in mind that we can write please let us know.
  
  Thanks, and we wish you a happy learning.
- Shaik Irfana Sultana
  3 years ago
  Reply
  
  Thanks! Rajeswari
Ravi kumar chegu
3 years ago
Reply

I never seen this type of explanation because this content very useful to who want to learn quickly in an easy way keep it up and we are waiting for your new article in such a way.
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Thanks, Ravi Kumar,
  
  We always go one step ahead to create the quality content. Sure, much more are coming on the way.
  
  Thanks, and happy learning!
- Shaik Irfana Sultana
  3 years ago
  Reply
  
  Thanks! Ravi Kumar
Junaid Ahmed
3 years ago
Reply

Good explanation with minimal use of words..
- Shaik Irfana Sultana
  3 years ago
  Reply
  
  Thanks! Ahmed
Sree hari
3 years ago
Reply

Excellent presentation skills, u written in easy way to get it. Keep it up👏👏👏👌
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Thanks Sree hari.
- Shaik Irfana Sultana
  3 years ago
  Reply
  
  Thanks! Sree hari
chowdary
3 years ago
Reply

very well explanation thory and coding part
keep it up irfana
chowdary
3 years ago
Reply

very well explanation along with theoretical and code part,
keep going irfana,
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Thanks! Chowdary.
  
  We are glad that you liked our article. We wish you happy learning.
- Shaik Irfana Sultana
  3 years ago
  Reply
  
  Thanks! Chowdary
Sandeep Panchal
3 years ago
Reply

Well detailed theory along with practical coding, Irfana.

Keep them coming!
Sandeep Panchal
3 years ago
Reply

Well detailed theory along with practical coding, Irfana.
Keep up the work! Expecting more of such articles.
- Saimadhu Polamuri
  3 years ago
  Reply
  
  Thanks, Sandeep,
  
  Many more amazing articles are on the way.

Dataaspirant

How the Hierarchical Clustering Algorithm Works

What is Unsupervised Learning

Types of Unsupervised Learning Algorithm

Unsupervised Learning Algorithms

What is Clustering

How is clustering different from classification?

Applications of Clustering

What is Hierarchical Clustering

Why Hierarchical Clustering

Types of Hierarchical Clustering

Agglomerative Clustering

How does Agglomerative Hierarchical Clustering work

Step 1

Step 2

Step 3

Step 4

Difference ways to measure the distance between two clusters

Simple Linkage

Pros and Cons of Simple Linkage method

Pros of Simple Linkage

Cons of Simple Linkage

Complete Linkage

Pros and Cons of Complete Linkage method

Pros of Complete Linkage

Cons of Complete Linkage

Average Linkage

Pros and Cons of the Average Linkage method

Pros of Average Linkage

Cons of Average Linkage

Centroid Linkage

Pros and Cons of Centroid Linkage method

Pros of Centroid Linkage

Cons of Centroid Linkage

Ward’s Linkage

Pros and Cons of Ward’s Linkage method

Pros of Ward's Linkage

Cons of Ward's Linkage

What is Dendrogram

Agglomerative Clustering Algorithm Implementation in Python

Importing the libraries and loading the data

Dendrogram to find the optimal number of clusters

Training the Hierarchical Clustering model on the dataset

Advantages and Disadvantages of Agglomerative Hierarchical Clustering Algorithm

Advantages

Disadvantages

Divisive Hierarchical Clustering

Steps to perform Divisive Clustering

Strengths and Limitations of Hierarchical Clustering Algorithm

Strengths of Hierarchical Clustering

Limitations of Hierarchical Clustering

Conclusion

Frequently Asked Questions (FAQs) On Hierarchical Clustering Algorithm

1. What is Hierarchical Clustering?

2. How Does Hierarchical Clustering Differ from K-Means Clustering?

3. What are Agglomerative and Divisive Hierarchical Clustering?

4. How Are Clusters Merged in the Agglomerative Approach?

5. Which Distance Metrics are Used in Hierarchical Clustering?

6. What is a Dendrogram in Hierarchical Clustering?

7. How Do I Determine the Optimal Number of Clusters?

8. What are Linkage Methods in Hierarchical Clustering?

9. How Scalable is Hierarchical Clustering?

10. Is It Necessary to Standardize Data Before Hierarchical Clustering?

11. Can I Use Hierarchical Clustering for Large Datasets?

12. How Robust is Hierarchical Clustering to Outliers?

Recommended Courses

Machine Learning Course

Deep Learning Course

NLP Course

Follow us:

FACEBOOK| QUORA |TWITTER| GOOGLE+ | LINKEDIN| REDDIT | FLIPBOARD | MEDIUM | GITHUB

27 Responses to “How the Hierarchical Clustering Algorithm Works”

Leave a Reply Cancel reply

Awarded top 75 data science blog

Data Science Dojo

Udacity

Recent Posts

Build Your Career In AI With Andrew ng Deep learning courses

Categories

Quick Links