Lesson Objectives

Understand how clustering methods can be used to partition genes with similar trends
Apply hierarchical clustering in R using hclust()

Further resources

Short article about clustering using transcriptome data:
- Naomi Altman & Martin Krzywinski (2017) Clustering, Nature Methods 14, 545–546
StatQuest video by Josh Starmer (watch all of them!):
- Hierarchical Clustering
Book chapter from Holmes & Huber Modern Statistics for Modern Biology:
- Clustering

Setup

Make sure you go through the previous lesson first, or run the following code:

##### setup ####

# load packages
library(tidyverse)

# read the data
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
test_result <- read_csv("./data/test_result.csv")


##### get counts for candidate genes ####

# set of candidate genes for clustering
candidate_genes <- test_result %>% 
  filter(padj < 0.01) %>%    # filter table
  pull(gene) %>%             # extract the gene column as a vector
  unique()                   # retain only unique values

# Summarise counts 
trans_cts_mean <- trans_cts %>% 
  # convert to long format
  pivot_longer(cols = wt_0_r1:mut_180_r3, names_to = "sample", values_to = "cts")  %>% 
  # join with sample info table
  full_join(sample_info, by = ("sample")) %>% 
  # filter to retain only genes of interest
  filter(gene %in% candidate_genes) %>% 
  # for each gene
  group_by(gene) %>% 
  # scale the cts column
  mutate(cts_scaled = (cts - mean(cts))/sd(cts)) %>% 
  # for each gene, strain and minute
  group_by(gene, strain, minute) %>%
  # calculate the mean (scaled) cts
  summarise(mean_cts_scaled = mean(cts_scaled),
            nrep = n()) %>% 
  ungroup()

Clustering basics

Similarly to what we explored in the PCA lesson, clustering methods can be helpful to group similar datapoints together.

There are different clustering algorithms and methods. Here we’re going to focus on hierarchical clustering, which is commonly used in exploratory data analysis. Another method that is commonly used is k-means, which we won’t cover here.

The idea with these clustering methods, is that they can help us interpret high dimensional data. In our case, we aim to understand what gene expression patterns we have across time in the two mutant strains.

Gene partitioning using hierarchical clustering

We will use hierarchical clustering to try and find some structure in our gene expression trends, and partition our genes into different clusters.

There’s two steps to this clustering procedure:

Calculate a “distance” metric between each pair of genes
Cluster the genes hierarchically using a particular agglomeration method

There are many choices that can be made at both steps. We will use a simple euclidean distance metric and complete linkage clustering method, which work well for these data (but you can experiment with other metrics/methods, that’s why it’s called “exploratory analysis”!).

In R, we can use the dist() function for the first step and hclust() for the second step.

Calculating distance between samples using `dist()`

The dist() function works best with a matrix of data. Like we did in the PCA lesson, let’s create a matrix from our table of counts.

# Create a matrix
hclust_matrix <- trans_cts %>% 
  select(-gene) %>% 
  as.matrix()

# assign rownames
rownames(hclust_matrix) <- trans_cts$gene

However, we do not want to use all of the genes in that matrix, but rather the subset of genes that we think are differentially expressed between time points. To subset a matrix, we need to use the “square bracket” notation [rows , columns]. Because the rows on our matrix are named, we can use our vector of gene names created earlier (candidate_genes) to do this:

hclust_matrix <- hclust_matrix[candidate_genes, ]

After this step, we want to scale the data (to obtain z-scores). The scale() function can be used with a matrix, where it will scale each column by its mean and standard deviation. However, we want to scale the expression of our genes, which are the rows of the matrix! So, we need to do a little gymnastics here, and first transpose our matrix, then scale, then transpose it back again.

We can do it with pipes, which makes the order of operations quite clear:

hclust_matrix <- hclust_matrix %>% 
  # transpose the matrix so genes are as columns
  t() %>% 
  # apply scalling to each column of the matrix (genes)
  scale() %>% 
  # transpose back so genes are as rows again
  t()

Finally, we can now calculate the distance between each gene (row) in our matrix:

gene_dist <- dist(hclust_matrix)

Perform hierarchical clustering using `hclust()`

With this distance matrix, we are ready to apply the hclust() function:

gene_hclust <- hclust(gene_dist, method = "complete")

# The default `plot()` function can be used to produce a simple dendrogram
plot(gene_hclust, labels = FALSE)
abline(h = 10, col = "brown", lwd = 2) # add horizontal line to illustrate cutting dendrogram

From the dendrogram above, we can see there’s substantial structure in our data. We can use the dendrogram to visualy determine how many groups we think are worth focusing on. You can imagine having an horizontal line that you slide down the dendrogram, cutting it into different groups (e.g. the brown line shown above).

We can use the cutree() function do this dendrogram “cutting”. For example, if you want to cut it into 5 groups, you would simply do:

cutree(gene_hclust, k = 5)

##   SPAC11D3.05  SPAC5H10.13c   SPAC13C5.04   SPAC2G11.12    SPAC821.09 
##             1             2             1             2             3 
## SPAC22A12.17c   SPAC10F6.15   SPAC1002.20    SPAC3H1.11 SPAC23C11.06c 
##             1             4             1             2             1

The output of cutree() is a named vector, where each gene is assigned to a cluster number (the numbers are somewhat arbitrary, but genes with the same number belong to the same cluster.

For analysis purposes, it’s convenient to convert this vector into a tibble, so we can join it with our table of gene expression counts that will be used to make a plot:

gene_cluster <- cutree(gene_hclust, k = 5) %>% 
  # turn the named vector into a tibble
  enframe() %>% 
  # rename some of the columns
  rename(gene = name, cluster = value)

head(gene_cluster)

## # A tibble: 6 x 2
##   gene          cluster
##   <chr>           <int>
## 1 SPAC11D3.05         1
## 2 SPAC5H10.13c        2
## 3 SPAC13C5.04         1
## 4 SPAC2G11.12         2
## 5 SPAC821.09          3
## 6 SPAC22A12.17c       1

Visualise gene expression trends per cluster

Finally, we are ready to visualise our gene expression trends, but separate our genes into their respective clusters.

First, let’s join our table with gene clusters to the table of gene summarised counts that we produced in the previous lesson:

trans_cts_cluster <- trans_cts_mean %>% 
  inner_join(gene_cluster, by = "gene")

head(trans_cts_cluster)

## # A tibble: 6 x 6
##   gene         strain minute mean_cts_scaled  nrep cluster
##   <chr>        <chr>   <dbl>           <dbl> <int>   <int>
## 1 SPAC1002.17c mut         0          -0.639     3       4
## 2 SPAC1002.17c mut        15          -0.584     3       4
## 3 SPAC1002.17c mut        30           1.03      3       4
## 4 SPAC1002.17c mut        60           2.12      3       4
## 5 SPAC1002.17c mut       120          -0.635     3       4
## 6 SPAC1002.17c mut       180          -1.03      3       4

And now we can make a plot, facetting according to strain and cluster:

trans_cts_cluster %>% 
  ggplot(aes(minute, mean_cts_scaled)) +
  geom_line(aes(group = gene)) +
  facet_grid(rows = vars(strain), cols = vars(cluster))

This is a much nicer picture, as we can now see how different clusters represent genes with (broadly) similar expression trends. It’s still not perfect, and you can play around by cutting the tree into even more groups and re-doing these plots.

It is worth pointing out that this result is not entirely surprising. Since we pre-filtered our genes to include those that showed marked difference between T0 and any of the other time points, it’s no surprise that our genes partition into groups having a peak (or dip) of expression at a particular time-point. However, this analysis illustrates how clustering can be used to help partition genes according to particular patterns, regardless of how you pre-selected them.

Finally, here is an advanced ggplot2 trick to add a line to each facet showing the median expression in each cluster:

trans_cts_cluster %>% 
  ggplot(aes(minute, mean_cts_scaled)) +
  geom_line(aes(group = gene), alpha = 0.3) +
  geom_line(stat = "summary", fun = "median", colour = "brown", size = 1.5, 
            aes(group = 1)) +
  facet_grid(rows = vars(strain), cols = vars(cluster))

Clustering using heatmap

Often, gene expression patterns are represented as heatmaps. These can be easily created using specialised functions that do the clustering and create the heatmap for you.

One package that provides many adanced features is the R/Bioconductor ComplexHeatmap.

It’s basic usage is:

library(ComplexHeatmap)
Heatmap(hclust_matrix, show_row_names = FALSE)

But you can make much more complex heatmaps - look at its full documentation

But note that although heatmaps are quite fun and colourful, they may not always be the best representation for your data. In this case, since we have time-data, the graphs we made ourselves above might be more adequate.

back to lesson’s homepage

Exploring gene expression patterns using clustering methods