1 Intro

1.1 Import data

Import the four CSV files into R and store in the following objects: raw_cts, trans_cts, sample_info and test_result.
How many samples are there? Was the design balanced (i.e. do all samples have the same number of replicates)?
How many genes do you have gene expression levels for?

# 1. Import data
raw_cts <- read_csv("./data/counts_raw.csv")
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
test_result <- read_csv("./data/test_result.csv")

# 2. number of samples is in the "sample_info" table
nrow(sample_info)

# 4. this can be taken from the table of counts
nrow(trans_cts)

2 Exploratory analysis of count data

2.1 Reshape table

Convert the raw_cts table to a “long” format using the pivot_longer() function. Save it into an object called raw_cts_long.

raw_cts_long <- raw_cts %>% 
  pivot_longer(wt_0_r1:mut_180_r3, names_to = "sample", values_to = "cts")

2.2 Join tables

Take the raw_cts_long object created in the previous exercise and join it with the sample_info table.
Produce the plot below for the raw count data. (hint: you might want to try log-transforming the x-axis data using the log10() function).

# Join with sample information table
raw_cts_long <- full_join(raw_cts_long, sample_info, by = ("sample"))

# Make the plot
raw_cts_long %>%
  # add pseudo-count of 1 because log(0) = -Inf
  ggplot(aes(log10(cts + 1), colour = replicate)) + 
  geom_freqpoly(binwidth = 1) +
  facet_grid(rows = vars(strain), cols = vars(minute))

Try out other ways to visualise these data, for example as a boxplot.

# Make a boxplot
raw_cts_long %>%
  # make sure minute is specified as a factor
  ggplot(aes(factor(minute), log10(cts + 1), fill = strain)) + 
  geom_boxplot() + 
  facet_grid(cols = vars(replicate))

2.3 Scatterplot

Compare the expression between a WT cell at T0 and T30. What can you conclude from this?

# Scatterplot between T0 and T30
# the correlation is lower than between replicates at T0, for example
trans_cts %>% 
  ggplot(aes(wt_0_r1, wt_30_r1)) + geom_point() +
  geom_abline(colour = "brown")

3 PCA

3.1 Examine `prcomp()` output

After running the PCA investigate:

What type of object is it? (hint: class())
The object returned by this function is a list-type object. We haven’t encountered these before. Use the str() function to examine what is inside the object (i.e. its structure).
Look at the prcomp() help page to identify which parts of the object contain the eigenvalues, the variable loadings and the PC scores (hint: look under the section “Value” of the help page). As a reminder:
- PC scores: new coordinates of the data on the PC axis
- eigenvalues: variance explained by each PC
- variable loadings: “weight” that each original gene has on each PC axis
To access elements of this list-type object we can use the $. For example, the PC scores can be obtained with: pc_scores <- sample_pca$x. Similarly to this, extract the eigenvalues and variable loadings to objects called pc_eigenvalues and pc_loadings.
- what class is each of these elements?
How many principal components do we have?

# 1. class of the object
class(sample_pca)

# 2. structure of the object
str(sample_pca)

# 3. checking the help ?prcomp, under the section "Value" is says:
# "sdev" contains the standard deviation explained by each PC, so if we square it we get the eigenvalues (or explained variance)
# "rotation" contains the variable loadings for each PC, which define the eigenvectors
# "x" contains the PC scores, i.e. the data projected on the new PC axis
# "center" in this case contains the mean of each gene, which was subtracted from each value
# "scale" contains the value FALSE because we did not scale the data by the standard deviation

# 4. we can use the 'dollar sign' to access these elements
pc_scores <- sample_pca$x              # PC scores (a matrix)
pc_eigenvalues <- sample_pca$sdev^2    # eigenvalues (a vector) - notice we square the values
pc_loadings <- sample_pca$rotation     # variable loadings (a matrix)

# 5. here's three ways to check this
ncol(pc_scores)
length(pc_eigenvalues)
ncol(pc_loadings)

3.2 Annotating PC plot

Fix the following code (where the word FIXME appears), to recreate the plot below. Once the code is fixed, assign (<-) the result to an object called pca_plot.

What can you conclude from this result?

# get the PC scores from prcomp object
sample_pca$x %>% 
  # convert it to a tibble
  as_tibble(rownames = "sample") %>% 
  # join with "sample_info" table
  FIXME(sample_info, by = "FIXME") %>% 
  # make the plot
  ggplot(aes(x = PC1, y = PC2, 
             FIXME = factor(minute), shape = FIXME)) +
  geom_point()

pca_plot <- sample_pca$x %>% # extract the loadings from prcomp
  # convert to a tibble retaining the sample names as a new column
  as_tibble(rownames = "sample") %>% 
  # join with "sample_info" table 
  full_join(sample_info, by = "sample") %>% 
  # create the plot
  ggplot(aes(x = PC1, y = PC2, colour = factor(minute), shape = strain)) +
  geom_point()

# print the result (in this case a ggplot)
pca_plot

3.3 Visualise variable loadings

Fix the following code (where the word FIXME appears), to recreate the plot below. Once the code is fixed, assign (<-) the result to an object called loadings_plot.

ggplot(data = top_loadings) +
  geom_segment(aes(x = 0, y = 0, xend = PC1, yend = FIXME), 
               arrow = arrow(length = unit(0.1, "in")),
               FIXME = "brown") +
  geom_text(aes(x = FIXME, y = PC2, label = gene),
            nudge_y = 0.005, size = 3) +
  scale_x_continuous(expand = c(0.02, 0.02))

loadings_plot <- ggplot(data = top_loadings) +
  geom_segment(aes(x = 0, y = 0, xend = PC1, yend = PC2), 
               arrow = arrow(length = unit(0.1, "in")),
               colour = "brown") +
  geom_text(aes(x = PC1, y = PC2, label = gene),
            nudge_y = 0.005, size = 3) +
  scale_x_continuous(expand = c(0.02, 0.02))
loadings_plot

4 Exploring test results

4.1 MA plot

Recreate the plot below from the test_result table. (hint: notice the x-axis is log-transformed)
Why is the fold-change log transformed?

# 1. making the MA plot
test_result %>% 
  ggplot(aes(log10(baseMean), log2FoldChange)) +
  geom_point(alpha = 0.1) +
  facet_wrap(vars(comparison))

# 2. The reason fold change is log-transformed is:
# Because a fold-change (FC) is a ratio between two things FC = a/b
# if a > b, then the FC can vary from 1 to infinity!
# but if a < b, then it can only go from 0 to 1
# therefore, ratios are not symmetric around equality (a = b)
# taking the log of a ratio solves this problem!
# For example:
# 4/1 = 4     and  log2(4/1) =  2
# 1/4 = 0.25  and  log2(1/4) = -2
# Note that another common example where log-transformation should always be used is RT-qPCR data!

Bonus: try and re-create the plot below where the x-axis is on a log-scale but showing the original units and genes with an adjusted p-value below 0.01 are highlighted in red. (hint: the function ifelse() is useful here. This may be hard if you’re new with R: but look at the solution and see if you can understand the trick we’re using.)

test_result %>% 
  # add column which contains value only if padj < 0.01
  mutate(sig = ifelse(padj < 0.01, log2FoldChange, NA)) %>% 
  # make the plot
  ggplot(aes(baseMean, log2FoldChange)) +
  geom_point(alpha = 0.1) +
  geom_point(aes(y = sig), colour = "brown", size = 1) +
  scale_x_continuous(trans = "log10") +
  facet_wrap(vars(comparison))

Exercises for Exploratory analysis of RNAseq data