raw_cts
, trans_cts
, sample_info
and test_result
.# 1. Import data
raw_cts <- read_csv("./data/counts_raw.csv")
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")
test_result <- read_csv("./data/test_result.csv")
# 2. number of samples is in the "sample_info" table
nrow(sample_info)
# 4. this can be taken from the table of counts
nrow(trans_cts)
Convert the raw_cts
table to a “long” format using the pivot_longer()
function. Save it into an object called raw_cts_long
.
raw_cts_long
object created in the previous exercise and join it with the sample_info
table.log10()
function).# Join with sample information table
raw_cts_long <- full_join(raw_cts_long, sample_info, by = ("sample"))
# Make the plot
raw_cts_long %>%
# add pseudo-count of 1 because log(0) = -Inf
ggplot(aes(log10(cts + 1), colour = replicate)) +
geom_freqpoly(binwidth = 1) +
facet_grid(rows = vars(strain), cols = vars(minute))
prcomp()
outputAfter running the PCA investigate:
class()
)str()
function to examine what is inside the object (i.e. its structure).prcomp()
help page to identify which parts of the object contain the eigenvalues, the variable loadings and the PC scores (hint: look under the section “Value” of the help page). As a reminder:
$
. For example, the PC scores can be obtained with: pc_scores <- sample_pca$x
. Similarly to this, extract the eigenvalues and variable loadings to objects called pc_eigenvalues
and pc_loadings
.
# 1. class of the object
class(sample_pca)
# 2. structure of the object
str(sample_pca)
# 3. checking the help ?prcomp, under the section "Value" is says:
# "sdev" contains the standard deviation explained by each PC, so if we square it we get the eigenvalues (or explained variance)
# "rotation" contains the variable loadings for each PC, which define the eigenvectors
# "x" contains the PC scores, i.e. the data projected on the new PC axis
# "center" in this case contains the mean of each gene, which was subtracted from each value
# "scale" contains the value FALSE because we did not scale the data by the standard deviation
# 4. we can use the 'dollar sign' to access these elements
pc_scores <- sample_pca$x # PC scores (a matrix)
pc_eigenvalues <- sample_pca$sdev^2 # eigenvalues (a vector) - notice we square the values
pc_loadings <- sample_pca$rotation # variable loadings (a matrix)
# 5. here's three ways to check this
ncol(pc_scores)
length(pc_eigenvalues)
ncol(pc_loadings)
Fix the following code (where the word FIXME appears), to recreate the plot below. Once the code is fixed, assign (<-
) the result to an object called pca_plot
.
What can you conclude from this result?
# get the PC scores from prcomp object
sample_pca$x %>%
# convert it to a tibble
as_tibble(rownames = "sample") %>%
# join with "sample_info" table
FIXME(sample_info, by = "FIXME") %>%
# make the plot
ggplot(aes(x = PC1, y = PC2,
FIXME = factor(minute), shape = FIXME)) +
geom_point()
pca_plot <- sample_pca$x %>% # extract the loadings from prcomp
# convert to a tibble retaining the sample names as a new column
as_tibble(rownames = "sample") %>%
# join with "sample_info" table
full_join(sample_info, by = "sample") %>%
# create the plot
ggplot(aes(x = PC1, y = PC2, colour = factor(minute), shape = strain)) +
geom_point()
# print the result (in this case a ggplot)
pca_plot
Fix the following code (where the word FIXME appears), to recreate the plot below. Once the code is fixed, assign (<-
) the result to an object called loadings_plot
.
ggplot(data = top_loadings) +
geom_segment(aes(x = 0, y = 0, xend = PC1, yend = FIXME),
arrow = arrow(length = unit(0.1, "in")),
FIXME = "brown") +
geom_text(aes(x = FIXME, y = PC2, label = gene),
nudge_y = 0.005, size = 3) +
scale_x_continuous(expand = c(0.02, 0.02))
test_result
table. (hint: notice the x-axis is log-transformed)# 1. making the MA plot
test_result %>%
ggplot(aes(log10(baseMean), log2FoldChange)) +
geom_point(alpha = 0.1) +
facet_wrap(vars(comparison))
# 2. The reason fold change is log-transformed is:
# Because a fold-change (FC) is a ratio between two things FC = a/b
# if a > b, then the FC can vary from 1 to infinity!
# but if a < b, then it can only go from 0 to 1
# therefore, ratios are not symmetric around equality (a = b)
# taking the log of a ratio solves this problem!
# For example:
# 4/1 = 4 and log2(4/1) = 2
# 1/4 = 0.25 and log2(1/4) = -2
# Note that another common example where log-transformation should always be used is RT-qPCR data!
Bonus: try and re-create the plot below where the x-axis is on a log-scale but showing the original units and genes with an adjusted p-value below 0.01 are highlighted in red. (hint: the function ifelse()
is useful here. This may be hard if you’re new with R: but look at the solution and see if you can understand the trick we’re using.)
test_result %>%
# add column which contains value only if padj < 0.01
mutate(sig = ifelse(padj < 0.01, log2FoldChange, NA)) %>%
# make the plot
ggplot(aes(baseMean, log2FoldChange)) +
geom_point(alpha = 0.1) +
geom_point(aes(y = sig), colour = "brown", size = 1) +
scale_x_continuous(trans = "log10") +
facet_wrap(vars(comparison))