Hierarchical clustering

Overview

Teaching: 60 min
Exercises: 10 min

Questions

What is hierarchical clustering and how does it differ from other clustering methods?

How do we carry out hierarchical clustering in R?

What distance matrix and linkage methods should we use?

How can we validate identified clusters?

Objectives

Understand when to use hierarchical clustering on high-dimensional data.

Perform hierarchical clustering on high-dimensional data and evaluate dendrograms.

Explore different distance matrix and linkage methods.

Use the Dunn index to validate clustering methods.

Why use hierarchical clustering on high-dimensional data?

When analysing high-dimensional data in the life sciences, it is often useful to identify groups of similar data points to understand more about the relationships within the dataset. In hierarchical clustering an algorithm groups similar data points (or observations) into groups (or clusters). This results in a set of clusters, where each cluster is distinct, and the data points within each cluster have similar characteristics. The clustering algorithm works by iteratively grouping data points so that different clusters may exist at different stages of the algorithm’s progression.

Unlike K-means clustering, hierarchical clustering does not require the number of clusters $k$ to be specified by the user before the analysis is carried out. Hierarchical clustering also provides an attractive dendrogram, a tree-like diagram showing the degree of similarity between clusters.

The dendrogram is a key feature of hierarchical clustering. This tree-shaped graph allows the similarity between data points in a dataset to be visualised and the arrangement of clusters produced by the analysis to be illustrated. Dendrograms are created using a distance (or dissimilarity) that quantify how different are pairs of observations, and a clustering algorithm to fuse groups of similar data points together.

In this episode we will explore hierarchical clustering for identifying clusters in high-dimensional data. We will use agglomerative hierarchical clustering (see box) in this episode.

Agglomerative and Divisive hierarchical clustering

There are two main methods of carrying out hierarchical clustering: agglomerative clustering and divisive clustering. The former is a ‘bottom-up’ approach to clustering whereby the clustering approach begins with each data point (or observation) being regarded as being in its own separate cluster. Pairs of data points are merged as we move up the tree. Divisive clustering is a ‘top-down’ approach in which all data points start in a single cluster and an algorithm is used to split groups of data points from this main group.

The agglomerative hierarchical clustering algorithm

To start with, we measure distance (or dissimilarity) between pairs of observations. Initially, and at the bottom of the dendrogram, each observation is considered to be in its own individual cluster. We start the clustering procedure by fusing the two observations that are most similar according to a distance matrix. Next, the next-most similar observations are fused so that the total number of clusters is number of observations - 2 (see panel below). Groups of observations may then be merged into a larger cluster (see next panel below, green box). This process continues until all the observations are included in a single cluster.

Figure 1a: Example data showing two clusters of observation pairs

Figure 1b: Example data showing fusing of one observation into larger cluster

A motivating example

To motivate this lesson, let’s first look at an example where hierarchical clustering is really useful, and then we can understand how to apply it in more detail. To do this, we’ll return to the large methylation dataset we worked with in the regression lessons. Let’s load the data and look at it.

library("minfi")
library("here")
library("ComplexHeatmap")

methyl <- readRDS(here("data/methylation.rds"))

# transpose this Bioconductor dataset to show features in columns
methyl_mat <- t(assay(methyl))

Looking at a heatmap of these data, we may spot some patterns – many columns appear to have a similar methylation levels across all rows. However, they are all quite jumbled at the moment, so it’s hard to tell how many line up exactly.

plot of chunk heatmap-noclust

We can order these data to make the patterns more clear using hierarchical clustering. To do this, we can change the arguments we pass to Heatmap() from the ComplexHeatmap package. Heatmap() groups features based on dissimilarity (here, Euclidean distance) and orders rows and columns to show clustering of features and observations.

Heatmap(methyl_mat,
  name = "Methylation level",
  cluster_rows = TRUE, cluster_columns = TRUE,
  row_dend_width = unit(0.2, "npc"),
  column_dend_height = unit(0.2, "npc"),
  show_row_names = FALSE, show_column_names = FALSE,
  row_title="Individuals", column_title = "Methylation sites"
)

plot of chunk heatmap-clust

We can see that clustering the features (CpG sites) results in an overall gradient of high to low methylation levels from left to right. Maybe more interesting is the fact that the rows (corresponding to individuals) are now grouped according to methylation patterns. For example, 12 samples seem to have lower methylation levels for a small subset of CpG sites in the middle, relative to all the other samples. It’s not clear without investigating further what the cause of this is – it could be a batch effect, or a known grouping (e.g., old vs young samples). However, clustering like this can be a useful part of exploratory analysis of data to build hypotheses.

Now, let’s cover the inner workings of hierarchical clustering in more detail. There are two things to consider before carrying out clustering:

how to define dissimilarity between observations using a distance matrix, and
how to define dissimilarity between clusters and when to fuse separate clusters.

Creating the distance matrix

Agglomerative hierarchical clustering is performed in two steps: calculating the distance matrix (containing distances between pairs of observations) and iteratively grouping observations into clusters using this matrix.

There are different ways to specify a distance matrix for clustering:

Specify distance as a pre-defined option using the method argument in dist(). Methods include euclidean (default), maximum and manhattan.
Create a self-defined function which calculates distance from a matrix or from two vectors. The function should only contain one argument.

Of pre-defined methods of calculating the distance matrix, Euclidean is one of the most commonly used. This method calculates the shortest straight-line distances between pairs of observations.

Another option is to use a correlation matrix as the input matrix to the clustering algorithm. The type of distance matrix used in hierarchical clustering can have a big effect on the resulting tree. The decision of which distance matrix to use before carrying out hierarchical clustering depends on the type of data and question to be addressed.

Linkage methods

The second step in performing hierarchical clustering after defining the distance matrix (or another function defining similarity between data points) is determining how to fuse different clusters.

Linkage is used to define dissimilarity between groups of observations (or clusters) and is used to create the hierarchical structure in the dendrogram. Different linkage methods of creating a dendrogram are discussed below.

hclust() supports various linkage methods (e.g complete, single, ward D, ward D2, average, median) and these are also supported within the Heatmap() function. The method used to perform hierarchical clustering in Heatmap() can be specified by the arguments clustering_method_rows and clustering_method_columns. Each linkage method uses a slightly different algorithm to calculate how clusters are fused together and therefore different clustering decisions are made depending on the linkage method used.

Complete linkage (the default in hclust()) works by computing all pairwise dissimilarities between data points in different clusters. For each pair of two clusters, it sets their dissimilarity ($d$) to the maximum dissimilarity value observed between any of these clusters’ constituent points. The two clusters with smallest value of $d$ are then fused.

Computing a dendrogram

Dendograms are useful tools to visualise the grouping of points and clusters into bigger clusters. We can create and plot dendrograms in R using hclust() which takes a distance matrix as input and creates the associated tree using hierarchical clustering. Here we create some example data to carry out hierarchical clustering.

Let’s generate 20 data points in 2D space. Each point belongs to one of three classes. Suppose we did not know which class data points belonged to and we want to identify these via cluster analysis. Hierarchical clustering carried out on the data can be used to produce a dendrogram showing how the data is partitioned into clusters. But how do we interpret this dendrogram? Let’s explore this using our example data.

#First, create some example data with two variables x1 and x2
set.seed(450)
example_data <- data.frame(
    x1 = rnorm(20, 8, 4.5),
    x2 = rnorm(20, 6, 3.4)
)

#plot the data and name data points by row numbers
plot(example_data$x1, example_data$x2, type = "n")
text(
    example_data$x1,
    example_data$x2,
    labels = rownames(example_data),
    cex = 0.7
)

plot of chunk plotexample

## calculate distance matrix using euclidean distance
dist_m <- dist(example_data, method = "euclidean")

Challenge 1

Use hclust() to implement hierarchical clustering using the distance matrix dist_m and the complete linkage method and plot the results as a dendrogram using plot().
Solution:
clust <- hclust(dist_m, method = "complete")
plot(clust)
plot of chunk plotclustex

This dendrogram shows similarities/differences in distances between data points. Each leaf of the dendrogram represents one of the 20 data points. These leaves fuse into branches as the height increases. Observations that are similar fuse into the same branches. The height at which any two data points fuse indicates how different these two points are. Points that fuse at the top of the tree are very different from each other compared with two points that fuse at the bottom of the tree, which are quite similar. You can see this by comparing the position of similar/dissimilar points according to the scatterplot with their position on the tree.

Identifying clusters based on the dendrogram

To do this, we can make a horizontal cut through the dendrogram at a user-defined height. The sets of observations beneath this cut can be thought of as distinct clusters. For example, a cut at height 10 produces two downstream clusters while a cut at height 4 produces six downstream clusters.

We can cut the dendrogram to determine number of clusters at different heights using cutree(). This function cuts a dendrogram into several groups (or clusters) where the number of desired groups is controlled by the user, by defining either k (number of groups) or h (height at which tree is cut).

## k is a user defined parameter determining
## the desired number of clusters at which to cut the treee
cutree(clust, k = 3)

 [1] 1 1 1 2 1 1 1 1 1 2 2 1 1 3 1 1 2 2 1 2

## h is a user defined parameter determining
## the numeric height at which to cut the tree
cutree(clust, h = 10)

 [1] 1 1 1 2 1 1 1 1 1 2 2 1 1 3 1 1 2 2 1 2

## both give same results 

four_cut <- cutree(clust, h = 4)

## we can produce the cluster each observation belongs to
## using the mutate and count functions
library(dplyr)
example_cl <- mutate(example_data, cluster = four_cut)
count(example_cl, cluster)

#plot cluster each point belongs to on original scatterplot
library(ggplot2)
ggplot(example_cl, aes(x = x2, y = x1, color = factor(cluster))) + geom_point()

plot of chunk cutree

Note that this cut produces 8 clusters (two before the cut and another six downstream of the cut).

Challenge 2:

Identify the value of k in cutree() that gives the same output as h = 5
Solution:
plot(clust)
## create horizontal line at height = 5
abline(h = 5, lty = 2)
plot of chunk h-k-ex-plot
cutree(clust, h = 5)
 [1] 1 2 2 3 4 5 2 5 1 6 6 2 5 7 4 4 6 6 5 6
cutree(clust, k = 7)
 [1] 1 2 2 3 4 5 2 5 1 6 6 2 5 7 4 4 6 6 5 6
five_cut <- cutree(clust, h = 5)

library(dplyr)
example_cl <- mutate(example_data, cluster = five_cut)
count(example_cl, cluster)
  cluster n
1       1 2
2       2 4
3       3 1
4       4 3
5       5 4
6       6 5
7       7 1
library(ggplot2)
ggplot(example_cl, aes(x=x2, y = x1, color = factor(cluster))) + geom_point()
plot of chunk h-k-ex-plot

Seven clusters (k = 7) gives similar results to h = 5. You can plot a horizontal line on the dendrogram at h = 5 to help identify corresponding value of k.

Highlighting dendrogram branches

In addition to visualising cluster identity in scatter plots, it is also possible to highlight branches in dentrograms. In this example, we calculate a distance matrix between samples in the methyl_mat dataset. We then draw boxes round clusters obtained with cutree.

## create a distance matrix using euclidean method
distmat <- dist(methyl_mat)
## hierarchical clustering using complete method
clust <- hclust(distmat)
## plot resulting dendrogram
plot(clust)

## draw border around three clusters
rect.hclust(clust, k = 3, border = 2:6)
## draw border around two clusters
rect.hclust(clust, k = 2, border = 2:6)

plot of chunk plot-clust-method

We can also colour clusters downstream of a specified cut using color_branches() from the dendextend package.

## cut tree at height = 4
cut <- cutree(clust, h = 50)

library("dendextend")
avg_dend_obj <- as.dendrogram(clust)      
## colour branches of dendrogram depending on clusters
plot(color_branches(avg_dend_obj, h = 50))

plot of chunk plot-coloured-branches

The effect of different linkage methods

Now let us look into changing the default behaviour of hclust(). Imagine we have two crescent-shaped point clouds as shown below.

# These two functions are to help us make crescents. Don't worry it you do not understand all this code.
# The importent bit is the object "cres", which consists of two columns (x and y coordinates of two crescents).
is.insideCircle <- function(co, r=0.5, offs=c(0,0)){
  sqrt((co[,1]+offs[1])^2 + (co[,2]+offs[2])^2) <= r
}
make.crescent <- function(n){
  raw <- cbind(x=runif(n)-0.5, y=runif(n)-0.5)
  raw[is.insideCircle(raw) & !is.insideCircle(raw, offs=c(0, -0.2)),]
}
# make x/y data in shape of two crescents
set.seed(123)
cres1 <- make.crescent(1000) # 1st crescent
cres2 <- make.crescent(1000) # 2nd crescent
cres2[,2] <- -cres2[,2] -0.1 # flip 2nd crescent upside-down and shift down
cres2[,1] <- cres2[,1] + 0.5 # shift second crescent to the right

cres <- rbind(cres1, cres2) # concatente x/y values
plot(cres)

plot of chunk crescents

We might expect that the crescents are resolved into separate clusters. But if we run hierarchical clustering with the default arguments, we get this:

cresClass <- cutree(hclust(dist(cres)), k=2) # save partition for colouring
plot(cres, col=cresClass) # colour scatterplot by partition

plot of chunk cresClustDefault

Challenge 3

Carry out hierarchical clustering on the cres data that we generated above. Try out different linkage methods and use cutree() to split each resulting dendrogram into two clusters. Plot the results colouring the dots according to their inferred cluster identity.

Which method(s) give you the expected clustering outcome?

Hint: Check ?hclust to see the possible values of the argument method (the linkage method used).
Solution:
#?hclust
# "complete", "single", "ward.D", "ward.D2", "average", "mcquitty", "median" or "centroid"
cresClassSingle <- cutree(hclust(dist(cres),method = "single"), k=2)
plot(cres, col=cresClassSingle, main="single")
plot of chunk plot-clust-comp
cresClassWard.D <- cutree(hclust(dist(cres), method="ward.D"), k=2)
plot(cres, col=cresClassWard.D, main="ward.D")
plot of chunk plot-clust-wardD
cresClassWard.D2 <- cutree(hclust(dist(cres), method="ward.D2"), k=2)
plot(cres, col=cresClassWard.D2, main="ward.D2")
plot of chunk plot-clust-wardD2
cresClassAverage <- cutree(hclust(dist(cres), method="average"), k=2)
plot(cres, col=cresClassAverage, main="average")
plot of chunk plot-clust-average
cresClassMcquitty <- cutree(hclust(dist(cres), method="mcquitty"), k=2)
plot(cres, col=cresClassMcquitty, main="mcquitty")
plot of chunk plot-clust-mcq
cresClassMedian<- cutree(hclust(dist(cres), method="median"), k=2) 
plot(cres, col=cresClassMedian, main="median")
plot of chunk plot-clust-median
cresClassCentroid<- cutree(hclust(dist(cres), method="centroid"), k=2)
plot(cres, col=cresClassCentroid, main="centroid")
plot of chunk plot-clust-centroid

The linkage methods single, ward.D, and average resolve each crescent as a separate cluster.

The help page of hclust() gives some intuition on linkage methods. It describes complete (the default) and single as opposite ends of a spectrum with all other methods in between. When using complete linkage, the distance between two clusters is assumed to be the distance between both clusters’ most distant points. This opposite it true for single linkage, where the minimum distance between any two points, one from each of two clusters is used. Single linkage is described as friends-of-friends appporach - and really, it groups all close-together points into the same cluster (thus resolving one cluster per crescent). Complete linkage on the other hand recognises that some points a the tip of a crescent are much closer to points in the other crescent and so it splits both crescents.

Using different distance methods

So far, we’ve been using Euclidean distance to define the dissimilarity or distance between observations. However, this isn’t always the best metric for how dissimilar different observations are. Let’s make an example to demonstrate. Here, we’re creating two samples each with ten observations of random noise:

set.seed(20)
cor_example <- data.frame(
  sample_a = rnorm(10),
  sample_b = rnorm(10)
)
rownames(cor_example) <- paste(
  "Feature", 1:nrow(cor_example)
)

Now, let’s create a new sample that has exactly the same pattern across all our features as sample_a, just offset by 5:

cor_example$sample_c <- cor_example$sample_a + 5

You can see that this is a lot like the assay() of our methylation object from earlier, where columns are observations or samples, and rows are features:

head(cor_example)

            sample_a    sample_b sample_c
Feature 1  1.1626853 -0.02013537 6.162685
Feature 2 -0.5859245 -0.15038222 4.414076
Feature 3  1.7854650 -0.62812676 6.785465
Feature 4 -1.3325937  1.32322085 3.667406
Feature 5 -0.4465668 -1.52135057 4.553433
Feature 6  0.5696061 -0.43742787 5.569606

If we plot a heatmap of this, we can see that sample_a and sample_b are grouped together because they have a small distance to each other, despite being quite different in their pattern across the different features. In contrast, sample_a and sample_c are very distant, despite having exactly the same pattern across the different features.

Heatmap(as.matrix(cor_example))

plot of chunk heatmap-cor-example

We can see that more clearly if we do a line plot:

## create a blank plot (type = "n" means don't draw anything)
## with an x range to hold the number of features we have.
## the range of y needs to be enough to show all the values for every feature
plot(
  1:nrow(cor_example),
  rep(range(cor_example), 5),
  type = "n"
)
## draw a red line for sample_a
lines(cor_example$sample_a, col = "firebrick")
## draw a blue line for sample_b
lines(cor_example$sample_b, col = "dodgerblue")
## draw a green line for sample_c
lines(cor_example$sample_c, col = "forestgreen")

plot of chunk lineplot-cor-example

We can see that sample_a and sample_c have exactly the same pattern across all of the different features. However, due to the overall difference between the values, they have a high distance to each other. We can see that if we cluster and plot the data ourselves using Euclidean distance:

clust_dist <- hclust(dist(t(cor_example)))
plot(clust_dist)

plot of chunk clust-euc-cor-example

In some cases, we might want to ensure that samples that have similar patterns, whether that be of gene expression, or DNA methylation, have small distances to each other. Correlation is a measure of this kind of similarity in pattern. However, high correlations indicate similarity, while for a distance measure we know that high distances indicate dissimilarity. Therefore, if we wanted to cluster observations based on the correlation, or the similarity of patterns, we can use 1 - cor(x) as the distance metric. The input to hclust() must be a dist object, so we also need to call as.dist() on it before passing it in.

cor_as_dist <- as.dist(1 - cor(cor_example))
clust_cor <- hclust(cor_as_dist)
plot(clust_cor)

plot of chunk clust-cor-cor-example

Now, sample_a and sample_c that have identical patterns across the features are grouped together, while sample_b is seen as distant because it has a different pattern, even though its values are closer to sample_a. Using your own distance function is often useful, especially if you have missing or unusual data. It’s often possible to use correlation and other custom distance functions to functions that perform hierarchical clustering, such as pheatmap() and stats::heatmap():

## pheatmap allows you to select correlation directly
pheatmap(as.matrix(cor_example), clustering_distance_cols = "correlation")

plot of chunk heatmap-cor-cor-example

## Using the built-in stats::heatmap 
heatmap(
  as.matrix(cor_example),
  distfun = function(x) as.dist(1 - cor(t(x)))
)

plot of chunk heatmap-cor-cor-example

Validating clusters

Now that we know how to carry out hierarchical clustering, how do we know how many clusters are optimal for the dataset?

Hierarchical clustering carried out on any dataset will produce clusters, even when there are no ‘real’ clusters in the data! We need to be able to determine whether identified clusters represent true groups in the data, or whether clusters have been identified just due to chance. In the last episode, we have introduced silhouette scores as a measure of cluster compactness and bootstrapping to assess cluster robustness. Such tests can be used to compare different clustering algorithms, for example, those fitted using different linkage methods.

Here, we introduce the Dunn index, which is a measure of cluster compactness. The Dunn index is the ratio of the smallest distance between any two clusters and to the largest intra-cluster distance found within any cluster. This can be seen as a family of indices which differ depending on the method used to compute distances. The Dunn index is a metric that penalises clusters that have larger intra-cluster variance and smaller inter-cluster variance. The higher the Dunn index, the better defined the clusters.

Let’s calculate the Dunn index for clustering carried out on the methyl_mat dataset using the clValid package.

## calculate dunn index
## (ratio of the smallest distance between obs not in the same cluster
## to the largest intra-cluster distance)
library("clValid")
## calculate euclidean distance between points 
distmat <- dist(methyl_mat)  
clust <- hclust(distmat, method = "complete")
plot(clust)

plot of chunk plot-clust-dunn

cut <- cutree(clust, h = 50)

## retrieve Dunn's index for given matrix and clusters
dunn(distance = distmat, cut)

[1] 0.8823501

The value of the Dunn index has no meaning in itself, but is used to compare between sets of clusters with larger values being preferred.

Challenge 4

Examine how changing the h or k arguments in cutree() affects the value of the Dunn index.

Solution:

library("clValid")

distmat <- dist(methyl_mat)
clust <- hclust(distmat, method = "complete")
plot(clust)

plot of chunk dunn-ex

#Varying h
## Obtaining the clusters
cut_h_20 <- cutree(clust, h = 20)
cut_h_30 <- cutree(clust, h = 30)

## How many clusters?
length(table(cut_h_20))

[1] 36

length(table(cut_h_30))

[1] 14

dunn(distance = distmat, cut_h_20)

[1] 1.61789

dunn(distance = distmat, cut_h_30)

[1] 0.8181846

#Varying k
## Obtaining the clusters
cut_k_10 <- cutree(clust, k = 10)
cut_k_5 <- cutree(clust, k = 5)

## How many clusters?
length(table(cut_k_5))

[1] 5

length(table(cut_k_10))

[1] 10

dunn(distance = distmat, cut_k_5)

[1] 0.8441528

dunn(distance = distmat, cut_k_10)

[1] 0.7967132

The figures below show in a more systematic way how changing the values of k and h using cutree() affect the Dunn index.

h_seq <- 70:10
h_dunn <- sapply(h_seq, function(x) dunn(distance = distmat, cutree(clust, h = x)))
k_seq <- seq(2, 10)
k_dunn <- sapply(k_seq, function(x) dunn(distance = distmat, cutree(clust, k = x)))
plot(h_seq, h_dunn, xlab = "Height (h)", ylab = "Dunn index")
grid()

Figure 3: Dunn index

You can see that at low values of h, the Dunn index can be high. But this is not very useful - cutting the given tree at a low h value like 15 leads to allmost all observations ending up each in its own cluster. More relevant is the second maximum in the plot, around h=55. Looking at the dendrogram, this corresponds to k=4.

plot(k_seq, k_dunn, xlab = "Number of clusters (k)", ylab = "Dunn index")
grid()

Figure 4: Dunn index continued

For the given range of k values explored, we obtain the highest Dunn index with k=4. This is in agreement with the previous plot.

There have been criticisms of the use of the Dunn index in validating clustering results, due to its high sensitivity to noise in the dataset. An alternative is to use silhouette scores (see the k-means clustering episode).

As we said before (see previous episode), clustering is a non-trivial task. It is important to think about the nature of your data and your expactations rather than blindly using a some algorithm for clustering or cluster validation.

previous episode

High dimensional statistics with R

lesson home

Hierarchical clustering

Overview

Why use hierarchical clustering on high-dimensional data?

Agglomerative and Divisive hierarchical clustering

The agglomerative hierarchical clustering algorithm

A motivating example

Creating the distance matrix

Linkage methods

Computing a dendrogram

Challenge 1

Solution:

Identifying clusters based on the dendrogram

Challenge 2:

Solution:

Highlighting dendrogram branches

The effect of different linkage methods

Challenge 3

Solution:

Using different distance methods

Validating clusters

Challenge 4

Solution:

Further reading

Key Points

previous episode

lesson home