Lesson Objectives

Understand how MA plots can be useful to explore the results of differential gene expression tests
Understand how scaling data can help in visualising gene expression trends

Further resources

Book chapter from Holmes & Huber Modern Statistics for Modern Biology:
- High-Throughput Count Data

Setup

In your project’s directory, create a new script called 04_gene_clustering.R, and start with the following code:

##### setup ####

# load packages
library(tidyverse)

# read the data
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
test_result <- read_csv("./data/test_result.csv")

MA plots

Before starting with our gene clustering, it’s useful to reduce the dimensionality of the data and work only with those genes that are likely to show some differences of expression.

We were given a table with the results of a differential analysis test between 0 min and each of the other time-points:

## # A tibble: 30,055 x 8
##    gene        baseMean log2FoldChange lfcSE   stat pvalue  padj comparison
##    <chr>          <dbl>          <dbl> <dbl>  <dbl>  <dbl> <dbl>      <dbl>
##  1 SPAC212.11      8.55         1.54   0.497  1.09  0.276      1         15
##  2 SPAC212.09c    50.8          0.399  0.273  0     1          1         15
##  3 SPAC212.04c    38.3         -0.0230 0.269  0     1          1         15
##  4 SPNCRNA.601     9.47        -0.0841 0.483  0     1          1         15
##  5 SPAC977.11     70.4         -0.819  0.201  0     1          1         15
##  6 SPAC977.13c    36.7          1.19   0.344  0.552 0.581      1         15
##  7 SPAC977.15     49.1          0.600  0.208  0     1          1         15
##  8 SPAC977.16c    83.2          0.148  0.239  0     1          1         15
##  9 SPNCRNA.607    60.4          0.0638 0.268  0     1          1         15
## 10 SPAC1F8.06     74.2         -1.58   0.298 -1.94  0.0520     1         15
## # … with 30,045 more rows

The padj column contains p-values (adjusted for multiple testing) of each test’s comparison. We can therefore use this table to help us focus on a subset of potentially interesting genes.

One way to visualise the differences between T0 and the other time points is with an MA plot, which shows the average expression of each gene plotted against the log-fold-change in expression between the samples:

Exercise:

Try to create the MA plot by yourself.

Why is the fold-change reported on a log-scale?

Link to full exercise

The plot above is in agreement with our PCA analysis, which showed that cells from T30 were overall transcriptomically quite distinct from T0 cells.

From the coloured clouds of points, we can also see that more genes seem to markedly increase than decrease their expression in relation to T0.

Visualising expression trends

Now, let’s focus on those genes where padj < 0.01 for any of the time-points and then see how their expression changes across time.

We start by extracting the genes of interest based on our criteria:

candidate_genes <- test_result %>% 
  filter(padj < 0.01) %>%    # filter table
  pull(gene) %>%             # extract the gene column as a vector
  unique()                   # retain only unique values

For convenience we also make a long version of our table (covered in the exploratory analysis lesson):

trans_cts_long <- trans_cts %>% 
  # convert to long format
  pivot_longer(cols = wt_0_r1:mut_180_r3, names_to = "sample", values_to = "cts")  %>% 
  # join with sample info table
  full_join(sample_info, by = ("sample"))

Now we filter our table to retain only genes of interest and summarise our counts per time point:

trans_cts_mean <- trans_cts_long %>% 
  # filter genes of interest
  filter(gene %in% candidate_genes) %>% 
  # for each gene, strain and minute
  group_by(gene, strain, minute) %>% 
  # calculate mean and number of replicates
  summarise(mean_cts = mean(cts),
            nrep = n()) %>% 
  # remove grouping from downstream analysis
  ungroup()

head(trans_cts_mean)

## # A tibble: 6 x 5
##   gene         strain minute mean_cts  nrep
##   <chr>        <chr>   <dbl>    <dbl> <int>
## 1 SPAC1002.17c mut         0     6.03     3
## 2 SPAC1002.17c mut        15     6.12     3
## 3 SPAC1002.17c mut        30     8.79     3
## 4 SPAC1002.17c mut        60    10.6      3
## 5 SPAC1002.17c mut       120     6.04     3
## 6 SPAC1002.17c mut       180     5.39     3

We want to explore patterns of expression across time in the two strains. We can plot each gene’s expression across time (although the result is very confusing!)

trans_cts_mean %>% 
  ggplot(aes(minute, mean_cts)) +
  geom_line(aes(group = gene), alpha = 0.3) +
  facet_grid(rows = vars(strain))

Scaling data to improve visualisation

When we are interested in relative patterns of expression, it is very useful to transform our data to ensure that the different genes are represented on a comparable scale.

You could imagine that two genes might have changed by the same magnitude (say, doubled the expression between time points), but their base mean levels of expression might have been quite different. For example, one gene might have changed from 10 to 20 and another from 100 to 200.

If what we’re interested in is the relative change in expression, then those two genes will appear more different than they really are in our plots.

A useful data transformation in this case is to center and scale each genes’ expression by their mean and standard deviation, respectively. The values thus obtained are known as z-scores, and can be interpreted as the “number of standard deviations away from the mean”. A positive z-score means that the gene’s expression was above the average across samples, whereas a negative one means it was below average. A value of zero means the gene’s expression was exactly average.

Let’s re-do our data summary, but scale the each gene’s expression data first:

trans_cts_mean <- trans_cts_long %>% 
  # filter to retain only genes of interest
  filter(gene %in% candidate_genes) %>% 
  # for each gene
  group_by(gene) %>% 
  # scale the cts column
  mutate(cts_scaled = (cts - mean(cts))/sd(cts)) %>% 
  # for each gene, strain and minute
  group_by(gene, strain, minute) %>%
  # calculate the mean (scaled) cts
  summarise(mean_cts_scaled = mean(cts_scaled),
            nrep = n()) %>% 
  ungroup()

And now we can re-do the plot, and see how this transformation changes our perception of the data:

trans_cts_mean %>%
  ggplot(aes(minute, mean_cts_scaled)) +
  geom_line(aes(group = gene), alpha = 0.2) + 
  geom_hline(yintercept = 0, colour = "brown", linetype = "dashed") +
  facet_grid(rows = vars(strain))

As you see, this is a substantially clearer picture than before, we can clearly see some patterns there. It’s not perfect yet: as you can see there are different “populations” of genes with different trends. In the clustering lesson we will see how we can partition the genes into different categories and separate them out so we don’t have this terrible spaghetti plot.