prcomp()
functionIn your project’s directory, create a new script called 03_pca_samples.R
, and start with the following code:
Having expression data for thousands of genes can be overwhelming to explore! This is a good example of a multi-dimensional dataset: we have many variables (genes) that we want to use to understand patterns of similarity between our samples (yeast cells).
There are several methods to help summarise multi-dimensional data, here we will show how to use PCA (principal component analysis).
PCA is a transformation of high-dimensional data into an orthogonal basis such that first principal component (PC, aka “axis”) is aligned with the largest source of variance, the second PC to the largest remaining source of variance and so on. This makes high-dimensional data more amenable to visual exploration, as we can examine projections to the first two (or few) PCs.
Link to the original animation
There are three basic types of information we obtain from a PCA:
prcomp()
To compute a PCA in R we can use the prcomp()
function. This function takes a matrix of data, where the columns are the variables that we want to use to transform our samples, which should be the rows of the matrix.
In our case, we want to look for similarities across our yeast cells (samples = rows) based on gene expression (variables = columns). For that reason, we need to provide a transposed version of our table to the prcomp()
function:
# Create a matrix from our table of counts
pca_matrix <- trans_cts %>%
# make the "gene" column become the rownames of the table
column_to_rownames("gene") %>%
# coerce to a matrix
as.matrix() %>%
# transpose the matrix so that rows = samples and columns = variables
t()
# Perform the PCA
sample_pca <- prcomp(pca_matrix)
Exercise:
Examine the output of the
prcomp()
function.
A matrix
is another type of object in R. Matrices are a bit similar to data.frame
, but they only contain values of a single type, in this case numeric values (whereas in a data.frame
different columns can contain different types of data).
In bioinformatics packages you will often have data stored in these matrix objects. It is therefore useful to know how to access values in them and how to convert them to a data.frame
.
You can access values in a matrix using [rows, columns]
notation:
## SPAC212.11 SPAC212.09c SPAC212.04c SPNCRNA.601 SPAC977.11
## wt_0_r1 4.947329 5.443589 5.751174 5.731843 7.046601
## wt_0_r2 4.969572 6.205209 5.058125 4.868198 6.837324
## wt_0_r3 5.657995 6.172470 5.544557 4.936256 6.788719
## wt_15_r1 5.452545 5.943911 5.675589 5.569892 6.870560
## wt_15_r2 5.444799 6.333457 5.175381 5.319763 6.092104
## wt_15_r3 6.501252 6.405294 5.101433 4.198365 6.189225
## wt_30_r1 5.078831 6.711639 5.823344 5.168548 6.908198
## wt_30_r2 5.254861 7.146486 6.313480 4.772003 6.658501
## wt_30_r3 5.324121 6.932336 6.595747 5.078932 6.932336
## wt_60_r1 4.853563 6.790664 5.618074 5.399681 6.220495
To convert this matrix into a tibble
object we can use the function as_tibble()
:
But now we’ve lost our sample names (which were the row names of the matrix)! If we look at the function’s help (?as_tibble
), we can see that there’s a way to solve this problem:
# Convert matrix to tibble - add colnames to a new column called "gene"
as_tibble(pca_matrix, rownames = "sample")
Now you know how to convert a matrix
to a data.frame
, which can be a very handy trick to have!
Often it is a good idea to standardize the variables before doing the PCA. This is often done by centering the data on the mean and then scaling it by dividing by the standard deviation. This ensures that the PCA is not too influenced by genes with higher absolute expression. By default, the prcomp()
function does the centering but not the scaling. See the ?prcomp
help to see how to change this default behaviour. In this case, because we are using the transformed data, this is not too much of an issue, but try re-running the PCA with the centered and scaled data to see how it changes it.
Note that scaling is particularly recommended if your variables are on different scales.
We talk more about standardizing gene expression data in the gene clustering lesson.
The first important question to ask after we do a PCA is how many PCs we have and how much variance they explain.
We need to extract the variance explained by each PC from our sample_pca
object. prcomp()
returns an oject of its own class. To access individual elements from this object, we use the $
notation, similarly to how you can access individual columns from a data.frame.
We’ve seen this in the previous exercise and already created an object with the data we are interested in:
This is a vector with variance explained by each of the 36 PC axis. But to make a plot with ggplot2 we need to have data in a data.frame/tibble.
Here’s one way we can do this:
# create a "tibble" manually with
# a variable indicating the PC number
# and a variable with the variances
pc_eigenvalues <- tibble(PC = factor(1:length(pc_eigenvalues)),
variance = pc_eigenvalues) %>%
# add a new column with the percent variance
mutate(pct = variance/sum(variance)*100) %>%
# add another column with the cumulative variance explained
mutate(pct_cum = cumsum(pct))
# print the result
pc_eigenvalues
## # A tibble: 36 x 4
## PC variance pct pct_cum
## <fct> <dbl> <dbl> <dbl>
## 1 1 351. 38.8 38.8
## 2 2 147. 16.3 55.2
## 3 3 76.7 8.50 63.7
## 4 4 51.2 5.67 69.3
## 5 5 48.8 5.41 74.7
## 6 6 29.5 3.26 78.0
## 7 7 27.0 2.99 81.0
## 8 8 16.8 1.86 82.8
## 9 9 14.3 1.59 84.4
## 10 10 13.5 1.49 85.9
## # … with 26 more rows
Notice that we’ve encoded the PC
variable as a factor because this is really a categorical variable (the identifier of the principal component).
This table can now be used to produce a Scree Plot, which shows the fraction of total variance explained by each principal component. We will show both the variance explained by individual PCs as well as the cumulative variance, using a type of visualisation known as a pareto chart:
pc_eigenvalues %>%
ggplot(aes(x = PC)) +
geom_col(aes(y = pct)) +
geom_line(aes(y = pct_cum, group = 1)) +
geom_point(aes(y = pct_cum)) +
labs(x = "Principal component", y = "Fraction variance explained")
You can see how successive PCs explain less and less of the total variance in the original data. Also note that 36 components are enough to virtually explain all of the variance in our dataset. This makes sense in this case since we only have 36 biological samples.
Next, we turn to visualising our samples on the new PC coordinates. Again, we had already extracted this from our prcomp
object in the previous exercise:
This object is of class matrix, so we need to first convert it to a data.frame/tibble for ggplot2.
pc_scores <- pc_scores %>%
# convert to a tibble retaining the sample names as a new column
as_tibble(rownames = "sample")
# print the result
pc_scores
## # A tibble: 36 x 37
## sample PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 wt_0_… -21.5 5.53 0.130 -8.28 10.2 -4.45 2.93 -0.121 0.299
## 2 wt_0_… -20.9 2.80 -13.3 -0.106 7.52 0.392 -0.137 -5.90 0.900
## 3 wt_0_… -20.8 1.83 -13.5 0.402 8.98 0.263 0.170 -3.55 1.13
## 4 wt_15… 12.2 27.9 11.1 -2.04 5.00 20.0 14.2 -3.50 -7.37
## 5 wt_15… 11.0 20.2 5.92 -1.18 -3.16 0.975 -6.95 -2.88 5.86
## 6 wt_15… 14.5 28.7 -4.50 26.1 -2.23 -16.8 10.5 0.393 -1.86
## 7 wt_30… 33.1 -6.93 2.18 -10.4 3.76 -4.28 8.81 4.05 5.50
## 8 wt_30… 35.7 -6.54 -14.8 0.719 -0.640 4.31 0.349 -3.51 2.83
## 9 wt_30… 33.2 -13.0 -16.9 1.83 2.43 3.47 1.76 -2.14 5.72
## 10 wt_60… -1.87 -11.9 10.3 9.11 2.64 4.40 1.37 -2.82 5.68
## # … with 26 more rows, and 27 more variables: PC10 <dbl>, PC11 <dbl>,
## # PC12 <dbl>, PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>,
## # PC18 <dbl>, PC19 <dbl>, PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>,
## # PC24 <dbl>, PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>,
## # PC30 <dbl>, PC31 <dbl>, PC32 <dbl>, PC33 <dbl>, PC34 <dbl>, PC35 <dbl>,
## # PC36 <dbl>
And now we can do a PC plot:
This is a very simple plot, but already we can see there is some structure in our data, suggesting clusters of samples that are more similar to each other.
Exercise:
Recreate the plot below. Save the plot in an object called
pca_plot
From this plot, we can see that:
Finally, to be able to interpret our PC-axis we could ask the question of which genes have the most influence on each PC axis. This information is contained in the variable loadings of the PCA, within the rotation
value of the prcomp
object.
We have already extracted this from our prcomp
object in the previous exercise:
Similarly to the PC scores, this object is a matrix, which we can convert to a data.frame/tibble for plotting and data manipulation:
## # A tibble: 6,011 x 37
## gene PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SPAC… 0.00653 0.0217 -0.0146 2.04e-2 0.00243 -0.00850 0.00546 -7.22e-3
## 2 SPAC… 0.00915 -0.0134 -0.00984 2.24e-2 -0.00994 0.0142 -0.00219 4.34e-4
## 3 SPAC… 0.0125 -0.0153 0.00109 -2.17e-2 0.00303 -0.00454 -0.0264 1.89e-2
## 4 SPNC… -0.00461 0.00571 0.00790 -2.29e-2 0.00604 0.0204 -0.00352 1.15e-2
## 5 SPAC… -0.00176 -0.00581 -0.0139 -2.42e-2 0.00252 -0.00346 0.0223 -9.61e-3
## 6 SPAC… 0.0342 -0.0227 -0.00138 -1.52e-2 0.0112 -0.00609 -0.00245 2.42e-3
## 7 SPAC… 0.0126 0.0223 -0.00249 8.72e-3 0.0388 -0.0166 0.00410 -1.82e-2
## 8 SPAC… 0.0102 0.0182 -0.00752 1.84e-2 0.0636 -0.00837 -0.0220 -1.61e-2
## 9 SPNC… 0.0221 -0.0357 0.0146 -1.93e-2 0.0104 0.00874 -0.00480 -1.01e-2
## 10 SPAC… -0.0159 -0.0270 -0.0168 8.82e-4 -0.0249 -0.0288 -0.0236 -8.97e-4
## # … with 6,001 more rows, and 28 more variables: PC9 <dbl>, PC10 <dbl>,
## # PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>,
## # PC17 <dbl>, PC18 <dbl>, PC19 <dbl>, PC20 <dbl>, PC21 <dbl>, PC22 <dbl>,
## # PC23 <dbl>, PC24 <dbl>, PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>,
## # PC29 <dbl>, PC30 <dbl>, PC31 <dbl>, PC32 <dbl>, PC33 <dbl>, PC34 <dbl>,
## # PC35 <dbl>, PC36 <dbl>
Because we have 6011 genes, it would be too much to visualise all of them at once. Instead, let’s try and answer the question: “What are the top 10 genes with highest loading on PC1 and PC2?”
There are different ways to do this, here we show one way, which would work for any number of PCs we are interested in:
top_genes <- pc_loadings %>%
# select only the PCs we are interested in
select(gene, PC1, PC2) %>%
# convert to a "long" format
pivot_longer(matches("PC"), names_to = "PC", values_to = "loading") %>%
# for each PC
group_by(PC) %>%
# arrange by descending order of loading
arrange(desc(abs(loading))) %>%
# take the 10 top rows
slice(1:10) %>%
# pull the gene column as a vector
pull(gene) %>%
# ensure only unique genes are retained
unique()
top_genes
## [1] "SPAC23H3.15c" "SPAC22A12.17c" "SPCPB16A4.07" "SPBC24C6.09c"
## [5] "SPACUNK4.17" "SPBC16E9.16c" "SPBC660.05" "SPAC32A11.02c"
## [9] "SPBC21C3.19" "SPAC139.05" "SPAC27D7.09c" "SPAC1002.19"
## [13] "SPAC1002.17c" "SPAC22H10.13" "SPBC106.02c" "SPAC19D5.01"
## [17] "SPBC839.06" "SPBC1105.14"
Now, we can use this list of gene names to subset the eigenvalues table:
Exercise:
Recreate the plot below. Save the plot in an object called
loadings_plot
Using the patchwork
package, we put these two plots side by side to get a more complete and interpretable view of our PCA analysis:
library(patchwork)
# Adjust some aspects of each plot
pca_plot <- pca_plot +
coord_fixed(ratio = 0.4) +
labs(title = "PC scores") +
theme(legend.position = "none")
loadings_plot <- loadings_plot +
coord_fixed(ratio = 0.4) +
labs(title = "PC loadings")
# Put them together
(pca_plot | loadings_plot) + plot_annotation(tag_levels = "A")
Notice how we’ve forced our PC1 axis to be around twice as long as the PC2 axis (with coord_fixed()
. This is because the variance explained by PC1 is about twice as large as the variance explained by PC2.
autoplot()
This was all very useful to understand how PCA works and how to fully explore its results. However, it can sometimes be convenient to produce these graphs quickly.
We can use the autoplot()
function from the ggfortify
package to do this:
There’s also information about how much variation each principal component axis explains.
We can annotate this plot further by providing the autoplot()
function with our sample information data (see ?autoplot.prcomp
for more information):
Finally, you could easily add information about loadings to the plot by adding the options loadings = TRUE, loadings.label = TRUE
. But don’t do it, because we have thousands of genes, so the plot would be un-interpretable (and probably crash your computer)!
prcomp()
output with the {broom}
packageOne useful package that is part of the {tidyverse}
collection is the {broom}
package, which is used to convert the output of statistical objects into tibbles.
Although {broom}
doesn’t recognise objects from every statistical object, it does work with prcomp
objects, and we could have used it instead of tidying our data manually.
Here is how we could have used it:
PCA is an essential and very powerful technique for exploratory data analysis of multivariate data. As you see from this example, we can quickly see how our data is structured, based on transcriptome-wide signatures.
Note that lack of clustering in a PCA does not mean nothing is happening. Imagine an RNAseq experiment where only a dozen genes changed due to your treatment. Because there’s few genes, transcriptome-wide you might not capture structure in the data.
In summary, although this analysis gives some indication of how much influence a particular gene has along each PC, it is not a replacement for specialised statistical models and tools that assess specific hypothesis for differential expression between conditions.
Finally, it’s worth pointing out that PCA is a linear algorithm. There are other methods for non-linear dimensionality reduction, which can sometimes perform better than PCA (but perhaps less easily interpretable). Two common methods are called UMAP and t-SNE. These can be found in the uwot
and Rtsne
packages, respectively.