How to perform hierarchical clustering in R

Hierarchical clustering in r

Hierarchical clustering in r

How to perform hierarchical clustering in R

        Over the last couple of articles, We learned different classification and regression algorithms. Now in this article, We are going to learn entirely another type of algorithm. Which falls into the unsupervised learning algorithms.

If you were not aware of unsupervised learning algorithms, all the machine learning algorithms mainly classified into two main categories.

  1. Supervised learning algorithms
  2. Unsupervised learning algorithms

All the classification and regression algorithms belong to supervised learning algorithms. The other set of algorithms which fall under unsupervised learning algorithms are clustering algorithms.

In fact, the foremost algorithms to study in unsupervised learning algorithms is clustering analysis algorithms. Today we are going to learn an algorithm to perform the cluster analysis.

We have a decent number of algorithms to perform cluster analysis; In this article, we will be learning how to perform the clustering with the Hierarchical clustering algorithm.

Before we drive further. Let’s have a look at the table of contents.

Table of contents:

How to perform hierarchical clustering in R Click To Tweet

What is clustering analysis?

Clustering the name itself has a deep meaning about the ongoing process which happens in the cluster analysis. The fundamental problem clustering address is to divide the data into meaningful groups (clusters).

When we say, meaningful groups, the meaningfulness purely depends on the intention behind the purpose of the group’s formations.

Suppose if we intend to cluster the search results for a particular keyword. We Intended to find all the search results which were meaningfully similar to the search keyword.

If we intend to cluster the search results for a particular location, then we need to group the search results belongs to one specific place.

The identified cluster elements within the same cluster should be similar to each other when compared to the other cluster elements.

Suppose we have 100 articles and we want to group them into different categories. Let’s consider the below categories.

  • Sports articles
  • Business articles
  • Entertainment articles

When we group all the 100 articles into the above 3 categories. All the articles belong to the sports category will be same, In the sense, the content in the sports articles belongs to sports category.

When you pick an article from sports category and the other article from business articles. Content-wise they will be completely different. This summarises the rule of thumb condition to form clusters.

All the elements in the same cluster should be similar and elements of the different cluster should not be similar.

Now let’s have a look at one clustering analysis example to understand it better.

Clustering analysis example

Clustering Algorithms

Clustering Example

Suppose from the above example of clustering gender based on hair length. You can notice all the female cluster elements are so close to each other, Whereas the male cluster elements are far from the female cluster elements.

Here it’s easy to say the clusters in the same cluster (female or male clusters ) elements are so close together by visualizing the plot. We use different clustering algorithms to create clusters by following the thumb rule all the elements in the same cluster has to be close together.

To calculate this closeness we use different similarity measures. These similarity measures determine whether the given point is closer by giving the similarity score.

Hierarchical clustering

    Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in the dataset and does not require to pre-specify the number of clusters to generate.

It refers to a set of clustering algorithms that build tree-like clusters by successively splitting or merging them. This hierarchical structure is represented using a tree.

Hierarchical clustering methods use a distance similarity measure to combine or split clusters. The recursive process continues until there is only one cluster left or we cannot split more clusters. We can use a dendrogram to represent the hierarchy of clusters.


dendrogram is a tree-like structure frequently used to illustrate the arrangement of the clusters produced by hierarchical clustering.

Hierarchical classifications produced by either 

  • Agglomerative
  • Divisive

The agglomerative or divisive route may be represented by a two-dimensional diagram known as a dendrogram, which illustrates the fusions or divisions made at each stage of the analysis. Agglomerative clustering usually yields a higher number of clusters, with fewer leaf nodes in the cluster.

In a hierarchical classification, the data are not partitioned into a particular number of classes or clusters at a single step. Instead, the classification consists of a series of partitions, which may run from a single cluster containing all individuals to n clusters each containing a single individual.

Hierarchical clustering algorithms can be either bottom-up or top-down.

Hierarchical clustering agglomerative and divisive methods

Hierarchical clustering agglomerative and divisive methods

Agglomerative clustering

Agglomerative clustering is Bottom-up technique start by considering each data point as its own cluster and merging them together into larger groups from the bottom up into a single giant cluster.

Divisive clustering

Divisive clustering is the opposite, it starts with one cluster, which is then divided in two as a function of the similarities or distances in the data. These new clusters are then divided, and so on until each case is a cluster.

Clustering linkage comparison

In this article, we describe the bottom-up approach in the detailed manner i.e. agglomerative algorithm 

The necessary steps of an agglomerative algorithm are (diagrammed visually in the figure below):

  1. Start with each point in its own cluster.
  2. Compare each pair of data points using a distance metric. This could be any of the methods discussed above.
  3. Use a linkage criterion to merge data points (at the first stage) or clusters (in subsequent phases), where the linkage is represented by a function such as:
    • Maximum or complete linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
    • Minimum or single linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
    • Mean or average linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.
    • Centroid linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
    • Ward’s minimum variance method: It minimizes the total within-cluster variance. At each step, the pair of clusters with minimum between-cluster distance are merged.
Single-link Complete-link Average-link Centroid distance

Single-link | Complete-link | Average-link | Centroid distance Image Credit::

Implementing hierarchical clustering in R programming language

Data Preparation

To perform a cluster analysis in R, generally, the data should be prepared as follows:

  1. Rows are observations (individuals) and columns are variables
  2. Any missing value in the data must be removed or estimated.
  3. The data must be standardized (i.e., scaled) to make variables comparable. Recall that, standardization consists of transforming the variables such that they have mean zero and standard deviation one.

Here, we’ll use the built-in R dataset iris, which contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

To remove any missing value that might be present in the data, type this:

As we don’t want the clustering algorithm to depend to an arbitrary variable unit, we start by scaling/standardizing the data using the R function scale:

Packages need to perform hierarchical clustering

  1. hclust [in stats package]
  2. agnes [in cluster package]

We can perform agglomerative HC with hclust. First, we compute the dissimilarity values with dist and then feed these values into hclust and specify the agglomeration method to be used (i.e. “complete”, “average”, “single”, “ward.D”). We can plot the dendrogram after this.


Alternatively, we can use the agnes function. These functions behave very similarly; however, with the agnes function, we can also get the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure).


This allows us to find certain hierarchical clustering methods that can identify stronger clustering structures. Here we see that Ward’s method identifies the strongest clustering structure of the four methods assessed.

Cluster dendrogram

Cluster dendrogram

Visualizing clustering in 3d view

Let’s examine, this time visually. How this algorithm proceeds using a simple dataset. 

As visual representations are limited to three dimensions, we will only use three attributes, but the computation is similar if we use more attributes. We will display these using the scatterplot3d() function of the plot3D package. 

which we will install and load after creating the attributes. We then examine the clustering solution provided by hclust() in order to assess whether it confirms the impressions we get from visual inspection.


Cluster dendrogram 3D view

Cluster dendrogram 3D view

In the left panel of the above plot, there are three groups of two points that are very close to each other. Another point is quite close to each of these groups of two. 

Consider that the groups of two constitute a group of three with the points that lie closest to them. Finally, the two groups on the left are closer to each other than they are to the group of three on the right. 

If we have a look at the dendrogram, we can see that the very same pattern is visible.

Complete Code

You can clone complete codes of dataaspirant from our GitHub account


Hierarchical methods form the backbone of cluster analysis. The need for hierarchical clustering naturally emerges in domains where it is not only required to discover similarity-based groups but also need to organize them. 

This is a valuable capability wherever the complexity of similarity patterns goes beyond the limited representation power of flat clustering models.

Hierarchical clustering tends to be presented as a form of descriptive rather than predictive modeling. 

Author Bio:

This article was contributed by Perceptive Analytics. Neeru Gupta, Chaitanya Sagar, Jyothirmayee Thondamallu and Saneesh Veetil contributed to this article.

Perceptive Analytics provides data analytics, business intelligence and reporting services to e-commerce, retail and pharmaceutical industries. Our client roster includes Fortune 500 and NYSE listed companies in the USA and India.

Follow us:


I hope you like this post. If you have any questions, then feel free to comment below.  If you want me to write on one particular topic, then do tell it to me in the comments below.

Related Courses: