In your project’s directory, create a new script called 04_gene_clustering.R
, and start with the following code:
##### setup ####
# load packages
library(tidyverse)
# read the data
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
test_result <- read_csv("./data/test_result.csv")
Before starting with our gene clustering, it’s useful to reduce the dimensionality of the data and work only with those genes that are likely to show some differences of expression.
We were given a table with the results of a differential analysis test between 0 min and each of the other time-points:
## # A tibble: 30,055 x 8
## gene baseMean log2FoldChange lfcSE stat pvalue padj comparison
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SPAC212.11 8.55 1.54 0.497 1.09 0.276 1 15
## 2 SPAC212.09c 50.8 0.399 0.273 0 1 1 15
## 3 SPAC212.04c 38.3 -0.0230 0.269 0 1 1 15
## 4 SPNCRNA.601 9.47 -0.0841 0.483 0 1 1 15
## 5 SPAC977.11 70.4 -0.819 0.201 0 1 1 15
## 6 SPAC977.13c 36.7 1.19 0.344 0.552 0.581 1 15
## 7 SPAC977.15 49.1 0.600 0.208 0 1 1 15
## 8 SPAC977.16c 83.2 0.148 0.239 0 1 1 15
## 9 SPNCRNA.607 60.4 0.0638 0.268 0 1 1 15
## 10 SPAC1F8.06 74.2 -1.58 0.298 -1.94 0.0520 1 15
## # … with 30,045 more rows
The padj
column contains p-values (adjusted for multiple testing) of each test’s comparison
. We can therefore use this table to help us focus on a subset of potentially interesting genes.
One way to visualise the differences between T0 and the other time points is with an MA plot, which shows the average expression of each gene plotted against the log-fold-change in expression between the samples:
Exercise:
Try to create the MA plot by yourself.
Why is the fold-change reported on a log-scale?
The plot above is in agreement with our PCA analysis, which showed that cells from T30 were overall transcriptomically quite distinct from T0 cells.
From the coloured clouds of points, we can also see that more genes seem to markedly increase than decrease their expression in relation to T0.
Now, let’s focus on those genes where padj < 0.01
for any of the time-points and then see how their expression changes across time.
We start by extracting the genes of interest based on our criteria:
candidate_genes <- test_result %>%
filter(padj < 0.01) %>% # filter table
pull(gene) %>% # extract the gene column as a vector
unique() # retain only unique values
For convenience we also make a long version of our table (covered in the exploratory analysis lesson):
trans_cts_long <- trans_cts %>%
# convert to long format
pivot_longer(cols = wt_0_r1:mut_180_r3, names_to = "sample", values_to = "cts") %>%
# join with sample info table
full_join(sample_info, by = ("sample"))
Now we filter our table to retain only genes of interest and summarise our counts per time point:
trans_cts_mean <- trans_cts_long %>%
# filter genes of interest
filter(gene %in% candidate_genes) %>%
# for each gene, strain and minute
group_by(gene, strain, minute) %>%
# calculate mean and number of replicates
summarise(mean_cts = mean(cts),
nrep = n()) %>%
# remove grouping from downstream analysis
ungroup()
head(trans_cts_mean)
## # A tibble: 6 x 5
## gene strain minute mean_cts nrep
## <chr> <chr> <dbl> <dbl> <int>
## 1 SPAC1002.17c mut 0 6.03 3
## 2 SPAC1002.17c mut 15 6.12 3
## 3 SPAC1002.17c mut 30 8.79 3
## 4 SPAC1002.17c mut 60 10.6 3
## 5 SPAC1002.17c mut 120 6.04 3
## 6 SPAC1002.17c mut 180 5.39 3
We want to explore patterns of expression across time in the two strains. We can plot each gene’s expression across time (although the result is very confusing!)
trans_cts_mean %>%
ggplot(aes(minute, mean_cts)) +
geom_line(aes(group = gene), alpha = 0.3) +
facet_grid(rows = vars(strain))
When we are interested in relative patterns of expression, it is very useful to transform our data to ensure that the different genes are represented on a comparable scale.
You could imagine that two genes might have changed by the same magnitude (say, doubled the expression between time points), but their base mean levels of expression might have been quite different. For example, one gene might have changed from 10 to 20 and another from 100 to 200.
If what we’re interested in is the relative change in expression, then those two genes will appear more different than they really are in our plots.
A useful data transformation in this case is to center and scale each genes’ expression by their mean and standard deviation, respectively. The values thus obtained are known as z-scores, and can be interpreted as the “number of standard deviations away from the mean”. A positive z-score means that the gene’s expression was above the average across samples, whereas a negative one means it was below average. A value of zero means the gene’s expression was exactly average.
Let’s re-do our data summary, but scale the each gene’s expression data first:
trans_cts_mean <- trans_cts_long %>%
# filter to retain only genes of interest
filter(gene %in% candidate_genes) %>%
# for each gene
group_by(gene) %>%
# scale the cts column
mutate(cts_scaled = (cts - mean(cts))/sd(cts)) %>%
# for each gene, strain and minute
group_by(gene, strain, minute) %>%
# calculate the mean (scaled) cts
summarise(mean_cts_scaled = mean(cts_scaled),
nrep = n()) %>%
ungroup()
And now we can re-do the plot, and see how this transformation changes our perception of the data:
trans_cts_mean %>%
ggplot(aes(minute, mean_cts_scaled)) +
geom_line(aes(group = gene), alpha = 0.2) +
geom_hline(yintercept = 0, colour = "brown", linetype = "dashed") +
facet_grid(rows = vars(strain))
As you see, this is a substantially clearer picture than before, we can clearly see some patterns there. It’s not perfect yet: as you can see there are different “populations” of genes with different trends. In the clustering lesson we will see how we can partition the genes into different categories and separate them out so we don’t have this terrible spaghetti plot.