How the Hierarchical Clustering Algorithm Works

Hierarchical Clustering Algorithm

How the Hierarchical Clustering Algorithm Works

  Hierarchical Clustering is an unsupervised Learning Algorithm, and this is one of the most popular clustering technique in Machine Learning. 

Expectations of getting insights from machine learning algorithms is increasing abruptly. Initially, we were limited to predict the future by feeding historical data. 

This is easy when the expected results and the features in the historical data are available to build the supervised learning models, which can predict the future.

For example predicting the email is spam or not, using the historical email data.

Learn hierarchical clustering algorithm in detail also, learn about agglomeration and divisive way of hierarchical clustering. #clustering #hierarchicalclustering

Click to Tweet

But the real world problems are not limited to supervised type, and we do get the unsupervised problems too.

How to build the models for such problems? 

Where comes the unsupervised learning algorithms.

In this article, we are going to learn one such popular unsupervised learning  algorithm which is hierarchical clustering algorithm.

Before we start learning, Let’s look at the topics you will learn in this article. Only if you read the complete article 🙂

Before we understand what hierarchical clustering is, its benefits, and how it works. Let us learn the unsupervised learning algorithm topic.

What is Unsupervised Learning

What Is Unsupervised Learning

Unsupervised learning is training a machine using information that is neither classified nor labeled and allows the machine to act on that information without guidance. 

In Unsupervised Learning, a machine’s task is to group unsorted information according to similarities, patterns, and differences without any prior data training. It is defined as

    “Unsupervised Learning Algorithm is a machine learning technique, where you don’t have to supervise the model. Rather, you need to allow the model to work on its own to discover information, and It mainly deals with unlabelled data.”

If you want to know more, we would suggest you to read the unsupervised learning algorithms article.

Types of Unsupervised Learning Algorithm

Unsupervised Learning algorithms are classified into two categories.

  • Clustering: Clustering is a technique of grouping objects into clusters. Objects with the most similarities remain in a group and have less or no similarities with another group’s objects.
  • Association: Association rule in unsupervised learning method, which helps in finding the relationships between variables in a large database. 

Unsupervised Learning Algorithms 

The list of some popular Unsupervised Learning algorithms are:

  • K-means Clustering
  • Hierarchical Clustering
  • Principal Component Analysis
  • Apriori Algorithm
  • Anomaly detection
  • Independent Component Analysis
  • Singular value decomposition

Before we learn about hierarchical clustering, we need to know about clustering and how it is different from classification.

What is Clustering

What Is Clustering

Clustering is an important technique when it comes to the unsupervised learning algorithm. Clustering mainly deals with finding a structure or pattern in a collection of uncategorized data.

It is a technique that groups similar objects such that objects in the same group are identical to each other than the objects in the other groups. The group of similar objects is called a Cluster.

How is clustering different from classification?

As a data science beginner, the difference between clustering and classification is confusing. So as the initial step, let us understand the fundamental difference between classification and clustering.

For example,

Let us say we have four categories

  1. Dog 
  2. Cat
  3. Shark
  4. Goldfish
Clustering Vs Classification Example

In this scenario, clustering would make 2 clusters. The one who lives on land and the other one lives in water. 

So the entities of the first cluster would be dogs and cats. Similarly, for the second cluster, it would be sharks and goldfishes. 

But in classification, it would classify the four categories into four different classes. One for each category.

So dogs would be classified under the class dog, and similarly, it would be for the rest.

In classification, we have labels to tell us and supervise whether the classification is right or not, and that is how we can classify them right. Thus making it a supervised learning algorithm.

But in clustering, despite distinctions, we cannot classify them because we don’t have labels for them. And that is why clustering is an unsupervised learning algorithm.

In real life, we can expect high volumes of data without labels. Because of such great use, clustering techniques have many real-time situations to help. Let us understand that.

Applications of Clustering

Clustering has a large number of applications spread across various domains. Some of the most popular applications of clustering are:

  • Recommendation Engines
  • Clustering similar news articles
  • Medical Imaging
  • Image Segmentation
  • Anomaly detection
  • Pattern Recognition

Till now, we got the in depth idea of what is unsupervised learning  and its types. We also learned what clustering and various applications of the clustering algorithm.

Now have a look at a detailed explanation of what is hierarchical clustering and why it is used?

What is Hierarchical Clustering

Hierarchical clustering is one of the popular clustering techniques after K-means Clustering. It is also known as Hierarchical Clustering Analysis (HCA) 

Which is used to group unlabelled datasets into a Cluster. This Hierarchical Clustering technique builds clusters based on the similarity between different objects in the set. 

It goes through the various features of the data points and looks for the similarity between them. 

This process will continue until the dataset has been grouped. Which creates a hierarchy for each of these clusters.

Hierarchical Clustering deals with the data in the form of a tree or a well-defined hierarchy.

Hierarchical Clustering Types Agglomerative and Divisive

Because of this reason, the algorithm is named as a hierarchical clustering algorithm.

This hierarchy way of clustering can be performed in two ways.

  • Agglomerative: Hierarchy created from bottom to top. 
  • Divisive: Hierarchy created from top to bottom.

In the next section of this article, let’s learn about these two ways in detail. For now, the above image gives you a high level of understanding. 

In the early sections of this article, we were given various algorithms to perform the clustering. But how is this hierarchical clustering different from other techniques?

Let’s discuss that.

Why Hierarchical Clustering

As we already have some clustering algorithms such as K-Means Clustering, then why do we need Hierarchical Clustering? 

As we have already seen in the K-Means Clustering algorithm article, it uses a pre-specified number of clusters. It requires advanced knowledge of K., i.e., how to define the number of clusters one wants to divide your data.

Still, in hierarchical clustering no need to pre-specify the number of clusters as we did in the K-Means Clustering; one can stop at any number of clusters. 

Furthermore, Hierarchical Clustering has an advantage over K-Means Clustering. i.e., it results in an attractive tree-based representation of the observations, called a Dendrogram.

Types of Hierarchical Clustering 

The Hierarchical Clustering technique has two types.

  • Agglomerative Hierarchical Clustering
    • Start with points as individual clusters.
    • At each step, it merges the closest pair of clusters until only one cluster ( or K clusters left).
  • Divisive Hierarchical Clustering
    • Start with one, all-inclusive cluster.
    • At each step, it splits a cluster until each cluster contains a point ( or there are clusters).

Agglomerative Clustering

It is also known as AGNES ( Agglomerative Nesting) and follows the bottom-up approach. 

Each observation starts with its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

That means the algorithm considers each data point as a single cluster initially and then starts combining the closest pair of clusters together. 

It does the same process until all the clusters are merged into a single cluster that contains all the datasets.

How does Agglomerative Hierarchical Clustering work 

Let’s take a sample of data and learn how the agglomerative hierarchical clustering work step by step.

Step 1

First, make each data point a “single - cluster,” which forms N clusters. (let’s assume there are N numbers of clusters).

Agglomerative approach step 1

Step 2 

Take the next two closest data points and make them one cluster; now, it forms N-1 clusters.

Agglomerative approach step 2

Step 3

Again, take the two clusters and make them one cluster; now, it forms N-2 clusters.

Agglomerative approach step 3

Step 4

Repeat ‘Step 3’ until you are left with only one cluster.

Agglomerative approach step 4

Once all the clusters are combined into a big cluster. We develop the Dendrogram to divide the clusters.

For the divisive hierarchical clustering, it treats all the data points as one cluster and splits the clustering until it creates meaningful clusters.

Difference ways to measure the distance between two clusters

There are several ways to measure the distance between in order to decide the rules for clustering, and they are often called Linkage Methods.

Some of the popular linkage methods are:

  • Simple Linkage
  • Complete Linkage
  • Average Linkage
  • Centroid Linkage
  • Ward’s Linkage

Simple Linkage

Simple Linkage Method

Simple Linkage is also known as the Minimum Linkage (MIN) method. 

In the Single Linkage method, the distance of two clusters is defined as the minimum distance between an object (point) in one cluster and an object (point) in the other cluster. This method is also known as the nearest neighbor method.

Pros and Cons of Simple Linkage method

Pros of Simple Linkage

  • Simple Linkage methods can handle non-elliptical shapes.
  • Single Linkage algorithms are the best for capturing clusters of different sizes.

Cons of Simple Linkage

  • Simple Linkage methods are sensitive to noise and outliers.
  • That means Simple Linkage methods can not group clusters properly if there is any noise between the clusters.

Complete Linkage

Complete Linkage Method

The complete Linkage method is also known as the Maximum Linkage (MAX) method. 

In the Complete Linkage technique, the distance between two clusters is defined as the maximum distance between an object (point) in one cluster and an object (point) in the other cluster.

And this method is also known as the furthest neighbor method.

Pros and Cons of Complete Linkage method

Pros of Complete Linkage

  • Complete Linkage algorithms are less susceptible to noise and outliers.
  • That means the Complete Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Complete Linkage

  • Complete linkage methods tend to break large clusters.
  • Complete Linkage is biased towards globular clusters.

Average Linkage 

Average Linkage Method

In the Average Linkage technique, the distance between two clusters is the average distance between each cluster’s point to every point in the other cluster.

This method is also known as the unweighted pair group method with arithmetic mean.

Pros and Cons of the Average Linkage method

Pros of Average Linkage

  • The average Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Average Linkage

  • The average Linkage method is biased towards globular clusters.

Centroid Linkage 

Centroid Linkage Method

In the Centroid Linkage approach, the distance between the two sets or clusters is the distance between two mean vectors of the sets (clusters).

At each stage, we combine the two sets that have the smallest centroid distance. In simple words, it is the distance between the centroids of the two sets.

Pros and Cons of Centroid Linkage method

Pros of Centroid Linkage

  • The Centroid Linkage method also does well in separating clusters if there is any noise between the clusters.

Cons of Centroid Linkage

  • Similar to Complete Linkage and Average Linkage methods, the Centroid Linkage method is also biased towards globular clusters.

Ward’s Linkage 

Wards Linkage Method

Ward's Linkage method is the similarity of two clusters. Which is based on the increase in squared error when two clusters are merged, and it is  similar to the group average if the distance between points is distance squared.

Pros and Cons of Ward’s Linkage method

Pros of Ward's Linkage

  • In many cases, Ward’s Linkage is preferred as it usually produces better cluster hierarchies
  • Ward’s method is less susceptible to noise and outliers.

Cons of Ward's Linkage

  • Ward’s linkage method is biased towards globular clusters.

Some of the other linkage methods are:

  • Strong Linkage
  • Flexible linkage
  • Simple Average

The Linkage method’s choice depends on you, and you can apply any of them according to the type of problem, and different linkage methods lead to different clusters.

Below is the comparison image, which shows all the linkage methods. We took this reference image from greatlearning platform blog.

Hierarchical Clustering Linkages

Hierarchical Clustering algorithms generate clusters that are organized into hierarchical structures.

These hierarchical structures can be visualized using a tree-like diagram called Dendrogram

Now let us discuss Dendrogram.

What is Dendrogram 

A Dendrogram is a diagram that represents the hierarchical relationship between objects. The Dendrogram is used to display the distance between each pair of sequentially merged objects. 

These are commonly used in studying hierarchical clusters before deciding the number of clusters significant to the dataset.

The distance at which the two clusters combine is referred to as the dendrogram distance. 

The primary use of a dendrogram is to work out the best way to allocate objects to clusters.

Hierarchical Clustering Dendrogram

The key point to interpreting or implementing a dendrogram is to focus on the closest objects in the dataset. 

Hence from the above figure, we can observe that the objects P6 and P5 are very close to each other, merging them into one cluster named C1, and followed by the object P4 is closed to the cluster C1, so combine these into a cluster (C2). 

And the objects P1 and P2 are close to each other so merge them into one cluster (C3), now cluster C3 is merged with the following object P0 and forms a cluster (C4), the object P3 is merged with the cluster C2, and finally the cluster C2 and C4 and merged into a single cluster (C6). 

Till now, we have a clear idea of the Agglomerative Hierarchical Clustering and Dendrograms. 

Now let us implement python code for the Agglomerative clustering technique.

Agglomerative Clustering Algorithm Implementation in Python 

Let us have a look at how to apply a hierarchical cluster in python on a Mall_Customers dataset

If you remembered, we have used the same dataset in the k-means clustering algorithms implementation too. 

Please refer to k-means article for getting the dataset.

Importing the libraries and loading the data 

We are importing all the necessary libraries, then we will load the data.

Input data overview

Dendrogram to find the optimal number of clusters


Training the Hierarchical Clustering model on the dataset 

Now, we are training our dataset using Agglomerative Hierarchical Clustering.

Hierarchical clustering result

Advantages and Disadvantages of Agglomerative Hierarchical Clustering Algorithm


  • The agglomerative technique is easy to implement.
  • It can produce an ordering of objects, which may be informative for the display.
  • In agglomerative Clustering, there is no need to pre-specify the number of clusters.
  • By the Agglomerative Clustering approach, smaller clusters will be created, which may discover similarities in data.


  • The agglomerative technique gives the best result in some cases only.
  • The algorithm can never undo what was done previously, which means if the objects may have been incorrectly grouped at an earlier stage, and the same result should be close to ensure it.
  • The usage of various distance metrics for measuring distances between the clusters may produce different results. So performing multiple experiments and then comparing the result is recommended to help the actual results’ veracity.

Divisive Hierarchical Clustering

Hierarchical Divisive Clustering

Divisive Hierarchical Clustering is also known as DIANA (Divisive Clustering Analysis.)

It is a top-down clustering approach. It works as similar as Agglomerative Clustering but in the opposite direction. 

This approach starts with a single cluster containing all objects and then splits the cluster into two least similar clusters based on their characteristics. We proceed with the same process until there is one cluster for each observation. 

Here, the divisive approach method is known as rigid, i.e., once a splitting is done on clusters, we can't revert it.

Steps to perform Divisive Clustering 

  • Initially, all the objects or points in the dataset belong to one single cluster.
  • Partition the single cluster into two least similar clusters.
  • And continue this process to form the new clusters until the desired number of clusters means one cluster for each observation.

Strengths and Limitations of Hierarchical Clustering Algorithm

For every algorithm, we do have strengths and limitations. If we don't know about these, we end up using these algorithms in the cases where they are limited not to use. So let’s learn this as well.

Strengths of Hierarchical Clustering 

  • It is to understand and implement.
  • We don’t have to pre-specify any particular number of clusters.
    • Can obtain any desired number of clusters by cutting the Dendrogram at the proper level.
  • They may correspond to meaningful classification.
  • Easy to decide the number of clusters by merely looking at the Dendrogram.

Limitations of Hierarchical Clustering

  • Hierarchical Clustering does not work well on vast amounts of data.
  • All the approaches to calculate the similarity between clusters have their own disadvantages.
  • In hierarchical Clustering, once a decision is made to combine two clusters, it can not be undone.
  • Different measures have problems with one or more of the following.
    • Sensitivity to noise and outliers.
    • Faces Difficulty when handling with different sizes of clusters.
    • It is breaking large clusters.
    • In this technique, the order of the data has an impact on the final results.


In this article, we discussed the hierarchical cluster algorithm’s in-depth intuition and approaches, such as the Agglomerative Clustering and Divisive Clustering approach.

Hierarchical Clustering is often used in the form of descriptive rather than predictive modeling.

Mostly we use Hierarchical Clustering when the application requires a hierarchy. The advantage of Hierarchical Clustering is we don’t have to pre-specify the clusters. 

However, it doesn’t work very well on vast amounts of data or huge datasets. And there are some disadvantages of the Hierarchical Clustering algorithm that it is not suitable for large datasets. And it gives the best results in some cases only.

Recommended Courses

clustering unsupervised learning

Cluster Analysis With Python

Rating: 4.6/5

unsupervised learning

Unsupervised Learning Algorithms

Rating: 4.5/5

A to Z Machine Learning with Python

Rating: 4.5/5

Follow us:


I hope you like this post. If you have any questions ? or want me to write an article on a specific topic? then feel free to comment below.


26 Responses to “How the Hierarchical Clustering Algorithm Works

Leave a Reply

Your email address will not be published. Required fields are marked *