hclust()
Make sure you go through the previous lesson first, or run the following code:
##### setup ####
# load packages
library(tidyverse)
# read the data
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
test_result <- read_csv("./data/test_result.csv")
##### get counts for candidate genes ####
# set of candidate genes for clustering
candidate_genes <- test_result %>%
filter(padj < 0.01) %>% # filter table
pull(gene) %>% # extract the gene column as a vector
unique() # retain only unique values
# Summarise counts
trans_cts_mean <- trans_cts %>%
# convert to long format
pivot_longer(cols = wt_0_r1:mut_180_r3, names_to = "sample", values_to = "cts") %>%
# join with sample info table
full_join(sample_info, by = ("sample")) %>%
# filter to retain only genes of interest
filter(gene %in% candidate_genes) %>%
# for each gene
group_by(gene) %>%
# scale the cts column
mutate(cts_scaled = (cts - mean(cts))/sd(cts)) %>%
# for each gene, strain and minute
group_by(gene, strain, minute) %>%
# calculate the mean (scaled) cts
summarise(mean_cts_scaled = mean(cts_scaled),
nrep = n()) %>%
ungroup()
Similarly to what we explored in the PCA lesson, clustering methods can be helpful to group similar datapoints together.
There are different clustering algorithms and methods. Here we’re going to focus on hierarchical clustering, which is commonly used in exploratory data analysis. Another method that is commonly used is k-means, which we won’t cover here.
The idea with these clustering methods, is that they can help us interpret high dimensional data. In our case, we aim to understand what gene expression patterns we have across time in the two mutant strains.
We will use hierarchical clustering to try and find some structure in our gene expression trends, and partition our genes into different clusters.
There’s two steps to this clustering procedure:
There are many choices that can be made at both steps. We will use a simple euclidean distance metric and complete linkage clustering method, which work well for these data (but you can experiment with other metrics/methods, that’s why it’s called “exploratory analysis”!).
In R, we can use the dist()
function for the first step and hclust()
for the second step.
dist()
The dist()
function works best with a matrix of data. Like we did in the PCA lesson, let’s create a matrix from our table of counts.
# Create a matrix
hclust_matrix <- trans_cts %>%
select(-gene) %>%
as.matrix()
# assign rownames
rownames(hclust_matrix) <- trans_cts$gene
However, we do not want to use all of the genes in that matrix, but rather the subset of genes that we think are differentially expressed between time points. To subset a matrix, we need to use the “square bracket” notation [rows , columns]
. Because the rows on our matrix are named, we can use our vector of gene names created earlier (candidate_genes
) to do this:
After this step, we want to scale the data (to obtain z-scores). The scale()
function can be used with a matrix, where it will scale each column by its mean and standard deviation. However, we want to scale the expression of our genes, which are the rows of the matrix! So, we need to do a little gymnastics here, and first transpose our matrix, then scale, then transpose it back again.
We can do it with pipes, which makes the order of operations quite clear:
hclust_matrix <- hclust_matrix %>%
# transpose the matrix so genes are as columns
t() %>%
# apply scalling to each column of the matrix (genes)
scale() %>%
# transpose back so genes are as rows again
t()
Finally, we can now calculate the distance between each gene (row) in our matrix:
hclust()
With this distance matrix, we are ready to apply the hclust()
function:
gene_hclust <- hclust(gene_dist, method = "complete")
# The default `plot()` function can be used to produce a simple dendrogram
plot(gene_hclust, labels = FALSE)
abline(h = 10, col = "brown", lwd = 2) # add horizontal line to illustrate cutting dendrogram
From the dendrogram above, we can see there’s substantial structure in our data. We can use the dendrogram to visualy determine how many groups we think are worth focusing on. You can imagine having an horizontal line that you slide down the dendrogram, cutting it into different groups (e.g. the brown line shown above).
We can use the cutree()
function do this dendrogram “cutting”. For example, if you want to cut it into 5 groups, you would simply do:
## SPAC11D3.05 SPAC5H10.13c SPAC13C5.04 SPAC2G11.12 SPAC821.09
## 1 2 1 2 3
## SPAC22A12.17c SPAC10F6.15 SPAC1002.20 SPAC3H1.11 SPAC23C11.06c
## 1 4 1 2 1
The output of cutree()
is a named vector, where each gene is assigned to a cluster number (the numbers are somewhat arbitrary, but genes with the same number belong to the same cluster.
For analysis purposes, it’s convenient to convert this vector into a tibble
, so we can join it with our table of gene expression counts that will be used to make a plot:
gene_cluster <- cutree(gene_hclust, k = 5) %>%
# turn the named vector into a tibble
enframe() %>%
# rename some of the columns
rename(gene = name, cluster = value)
head(gene_cluster)
## # A tibble: 6 x 2
## gene cluster
## <chr> <int>
## 1 SPAC11D3.05 1
## 2 SPAC5H10.13c 2
## 3 SPAC13C5.04 1
## 4 SPAC2G11.12 2
## 5 SPAC821.09 3
## 6 SPAC22A12.17c 1
Finally, we are ready to visualise our gene expression trends, but separate our genes into their respective clusters.
First, let’s join our table with gene clusters to the table of gene summarised counts that we produced in the previous lesson:
trans_cts_cluster <- trans_cts_mean %>%
inner_join(gene_cluster, by = "gene")
head(trans_cts_cluster)
## # A tibble: 6 x 6
## gene strain minute mean_cts_scaled nrep cluster
## <chr> <chr> <dbl> <dbl> <int> <int>
## 1 SPAC1002.17c mut 0 -0.639 3 4
## 2 SPAC1002.17c mut 15 -0.584 3 4
## 3 SPAC1002.17c mut 30 1.03 3 4
## 4 SPAC1002.17c mut 60 2.12 3 4
## 5 SPAC1002.17c mut 120 -0.635 3 4
## 6 SPAC1002.17c mut 180 -1.03 3 4
And now we can make a plot, facetting according to strain
and cluster
:
trans_cts_cluster %>%
ggplot(aes(minute, mean_cts_scaled)) +
geom_line(aes(group = gene)) +
facet_grid(rows = vars(strain), cols = vars(cluster))
This is a much nicer picture, as we can now see how different clusters represent genes with (broadly) similar expression trends. It’s still not perfect, and you can play around by cutting the tree into even more groups and re-doing these plots.
It is worth pointing out that this result is not entirely surprising. Since we pre-filtered our genes to include those that showed marked difference between T0 and any of the other time points, it’s no surprise that our genes partition into groups having a peak (or dip) of expression at a particular time-point. However, this analysis illustrates how clustering can be used to help partition genes according to particular patterns, regardless of how you pre-selected them.
Finally, here is an advanced ggplot2
trick to add a line to each facet showing the median expression in each cluster:
Often, gene expression patterns are represented as heatmaps. These can be easily created using specialised functions that do the clustering and create the heatmap for you.
One package that provides many adanced features is the R/Bioconductor ComplexHeatmap
.
It’s basic usage is:
But you can make much more complex heatmaps - look at its full documentation
But note that although heatmaps are quite fun and colourful, they may not always be the best representation for your data. In this case, since we have time-data, the graphs we made ourselves above might be more adequate.