back to lesson’s homepage

Lesson Objectives

  • Gain an intuitive understanding of principal component analysis (PCA)
  • Use PCA to derive information about structure in multi-dimensional datasets such as RNAseq
  • Apply PCA using R’s prcomp() function
  • Understand how to access data from list-like objects
  • Apply several data manipulation functions to convert these objects into tidy tibbles (and produce visualisations of interest)

Further resources

Setup

In your project’s directory, create a new script called 03_pca_samples.R, and start with the following code:

##### setup ####

# load packages
library(tidyverse)

# read the data
trans_cts <- read_csv("./data/counts_transformed.csv")
sample_info <- read_csv("./data/sample_info.csv")

PCA basics

Having expression data for thousands of genes can be overwhelming to explore! This is a good example of a multi-dimensional dataset: we have many variables (genes) that we want to use to understand patterns of similarity between our samples (yeast cells).

There are several methods to help summarise multi-dimensional data, here we will show how to use PCA (principal component analysis).

PCA is a transformation of high-dimensional data into an orthogonal basis such that first principal component (PC, aka “axis”) is aligned with the largest source of variance, the second PC to the largest remaining source of variance and so on. This makes high-dimensional data more amenable to visual exploration, as we can examine projections to the first two (or few) PCs.

Link to the original animation

There are three basic types of information we obtain from a PCA:

  • PC scores - these are the coordinates of our samples on the new PC axis
  • Eigenvalues - these represent the variance explained by each PC. We can use these to calculate the proportion of variance in the original data that each axis explains.
  • Variable loadings (eigenvectors) - these reflect the “weight” that each variable has on a particular PC. These can be thought of as the correlation between the PC and the original variable.

Running PCA using prcomp()

To compute a PCA in R we can use the prcomp() function. This function takes a matrix of data, where the columns are the variables that we want to use to transform our samples, which should be the rows of the matrix.

In our case, we want to look for similarities across our yeast cells (samples = rows) based on gene expression (variables = columns). For that reason, we need to provide a transposed version of our table to the prcomp() function:

# Create a matrix from our table of counts
pca_matrix <- trans_cts %>% 
  # make the "gene" column become the rownames of the table
  column_to_rownames("gene") %>% 
  # coerce to a matrix
  as.matrix() %>% 
  # transpose the matrix so that rows = samples and columns = variables
  t()

# Perform the PCA
sample_pca <- prcomp(pca_matrix)

Exercise:

Examine the output of the prcomp() function.

Link to full exercise


A note about matrices

A matrix is another type of object in R. Matrices are a bit similar to data.frame, but they only contain values of a single type, in this case numeric values (whereas in a data.frame different columns can contain different types of data).

In bioinformatics packages you will often have data stored in these matrix objects. It is therefore useful to know how to access values in them and how to convert them to a data.frame.

You can access values in a matrix using [rows, columns] notation:

# Look at the first 10 rows and first 5 columns of the matrix
pca_matrix[1:10, 1:5]
##          SPAC212.11 SPAC212.09c SPAC212.04c SPNCRNA.601 SPAC977.11
## wt_0_r1    4.947329    5.443589    5.751174    5.731843   7.046601
## wt_0_r2    4.969572    6.205209    5.058125    4.868198   6.837324
## wt_0_r3    5.657995    6.172470    5.544557    4.936256   6.788719
## wt_15_r1   5.452545    5.943911    5.675589    5.569892   6.870560
## wt_15_r2   5.444799    6.333457    5.175381    5.319763   6.092104
## wt_15_r3   6.501252    6.405294    5.101433    4.198365   6.189225
## wt_30_r1   5.078831    6.711639    5.823344    5.168548   6.908198
## wt_30_r2   5.254861    7.146486    6.313480    4.772003   6.658501
## wt_30_r3   5.324121    6.932336    6.595747    5.078932   6.932336
## wt_60_r1   4.853563    6.790664    5.618074    5.399681   6.220495

To convert this matrix into a tibble object we can use the function as_tibble():

# Convert matrix to tibble
as_tibble(pca_matrix)

But now we’ve lost our sample names (which were the row names of the matrix)! If we look at the function’s help (?as_tibble), we can see that there’s a way to solve this problem:

# Convert matrix to tibble - add colnames to a new column called "gene"
as_tibble(pca_matrix, rownames = "sample")

Now you know how to convert a matrix to a data.frame, which can be a very handy trick to have!

Important note

Often it is a good idea to standardize the variables before doing the PCA. This is often done by centering the data on the mean and then scaling it by dividing by the standard deviation. This ensures that the PCA is not too influenced by genes with higher absolute expression. By default, the prcomp() function does the centering but not the scaling. See the ?prcomp help to see how to change this default behaviour. In this case, because we are using the transformed data, this is not too much of an issue, but try re-running the PCA with the centered and scaled data to see how it changes it.

Note that scaling is particularly recommended if your variables are on different scales.

We talk more about standardizing gene expression data in the gene clustering lesson.

Variance explained by PCs

The first important question to ask after we do a PCA is how many PCs we have and how much variance they explain.

We need to extract the variance explained by each PC from our sample_pca object. prcomp() returns an oject of its own class. To access individual elements from this object, we use the $ notation, similarly to how you can access individual columns from a data.frame.

We’ve seen this in the previous exercise and already created an object with the data we are interested in:

pc_eigenvalues <- sample_pca$sdev^2

This is a vector with variance explained by each of the 36 PC axis. But to make a plot with ggplot2 we need to have data in a data.frame/tibble.

Here’s one way we can do this:

# create a "tibble" manually with 
# a variable indicating the PC number
# and a variable with the variances
pc_eigenvalues <- tibble(PC = factor(1:length(pc_eigenvalues)), 
                         variance = pc_eigenvalues) %>% 
  # add a new column with the percent variance
  mutate(pct = variance/sum(variance)*100) %>% 
  # add another column with the cumulative variance explained
  mutate(pct_cum = cumsum(pct))

# print the result
pc_eigenvalues
## # A tibble: 36 x 4
##    PC    variance   pct pct_cum
##    <fct>    <dbl> <dbl>   <dbl>
##  1 1        351.  38.8     38.8
##  2 2        147.  16.3     55.2
##  3 3         76.7  8.50    63.7
##  4 4         51.2  5.67    69.3
##  5 5         48.8  5.41    74.7
##  6 6         29.5  3.26    78.0
##  7 7         27.0  2.99    81.0
##  8 8         16.8  1.86    82.8
##  9 9         14.3  1.59    84.4
## 10 10        13.5  1.49    85.9
## # … with 26 more rows

Notice that we’ve encoded the PC variable as a factor because this is really a categorical variable (the identifier of the principal component).

This table can now be used to produce a Scree Plot, which shows the fraction of total variance explained by each principal component. We will show both the variance explained by individual PCs as well as the cumulative variance, using a type of visualisation known as a pareto chart:

pc_eigenvalues %>% 
  ggplot(aes(x = PC)) +
  geom_col(aes(y = pct)) +
  geom_line(aes(y = pct_cum, group = 1)) + 
  geom_point(aes(y = pct_cum)) +
  labs(x = "Principal component", y = "Fraction variance explained")

You can see how successive PCs explain less and less of the total variance in the original data. Also note that 36 components are enough to virtually explain all of the variance in our dataset. This makes sense in this case since we only have 36 biological samples.

Visualising samples on PC space

Next, we turn to visualising our samples on the new PC coordinates. Again, we had already extracted this from our prcomp object in the previous exercise:

# The PC scores are stored in the "x" value of the prcomp object
pc_scores <- sample_pca$x

This object is of class matrix, so we need to first convert it to a data.frame/tibble for ggplot2.

pc_scores <- pc_scores %>% 
  # convert to a tibble retaining the sample names as a new column
  as_tibble(rownames = "sample")

# print the result
pc_scores
## # A tibble: 36 x 37
##    sample    PC1    PC2     PC3     PC4    PC5     PC6    PC7    PC8    PC9
##    <chr>   <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1 wt_0_… -21.5    5.53   0.130  -8.28  10.2    -4.45   2.93  -0.121  0.299
##  2 wt_0_… -20.9    2.80 -13.3    -0.106  7.52    0.392 -0.137 -5.90   0.900
##  3 wt_0_… -20.8    1.83 -13.5     0.402  8.98    0.263  0.170 -3.55   1.13 
##  4 wt_15…  12.2   27.9   11.1    -2.04   5.00   20.0   14.2   -3.50  -7.37 
##  5 wt_15…  11.0   20.2    5.92   -1.18  -3.16    0.975 -6.95  -2.88   5.86 
##  6 wt_15…  14.5   28.7   -4.50   26.1   -2.23  -16.8   10.5    0.393 -1.86 
##  7 wt_30…  33.1   -6.93   2.18  -10.4    3.76   -4.28   8.81   4.05   5.50 
##  8 wt_30…  35.7   -6.54 -14.8     0.719 -0.640   4.31   0.349 -3.51   2.83 
##  9 wt_30…  33.2  -13.0  -16.9     1.83   2.43    3.47   1.76  -2.14   5.72 
## 10 wt_60…  -1.87 -11.9   10.3     9.11   2.64    4.40   1.37  -2.82   5.68 
## # … with 26 more rows, and 27 more variables: PC10 <dbl>, PC11 <dbl>,
## #   PC12 <dbl>, PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>,
## #   PC18 <dbl>, PC19 <dbl>, PC20 <dbl>, PC21 <dbl>, PC22 <dbl>, PC23 <dbl>,
## #   PC24 <dbl>, PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>, PC29 <dbl>,
## #   PC30 <dbl>, PC31 <dbl>, PC32 <dbl>, PC33 <dbl>, PC34 <dbl>, PC35 <dbl>,
## #   PC36 <dbl>

And now we can do a PC plot:

pc_scores %>% 
  # create the plot
  ggplot(aes(x = PC1, y = PC2)) +
  geom_point()

This is a very simple plot, but already we can see there is some structure in our data, suggesting clusters of samples that are more similar to each other.


Exercise:

Recreate the plot below. Save the plot in an object called pca_plot

Link to full exercise


From this plot, we can see that:

  • samples seem to cluster by time-point (is it related to time of exposure to stress? Or perhaps cell cycle phase, if it was not controlled for?)
  • T120 and T180 seem to be undistinguishable using PC1 and PC2 (but they might be distinguishable if we explore other PC axis!).
  • the genotype does not drive major changes in the transcriptome (but, again, it doesn’t mean that no genes differ between the genotypes, just that they don’t explain most of the variance captured by PC1 and PC2)

Exploring correlation between genes and PCs

Finally, to be able to interpret our PC-axis we could ask the question of which genes have the most influence on each PC axis. This information is contained in the variable loadings of the PCA, within the rotation value of the prcomp object.

We have already extracted this from our prcomp object in the previous exercise:

pc_loadings <- sample_pca$rotation

Similarly to the PC scores, this object is a matrix, which we can convert to a data.frame/tibble for plotting and data manipulation:

pc_loadings <- pc_loadings %>% 
  as_tibble(rownames = "gene")

# print the result
pc_loadings
## # A tibble: 6,011 x 37
##    gene       PC1      PC2      PC3      PC4      PC5      PC6      PC7      PC8
##    <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
##  1 SPAC…  0.00653  0.0217  -0.0146   2.04e-2  0.00243 -0.00850  0.00546 -7.22e-3
##  2 SPAC…  0.00915 -0.0134  -0.00984  2.24e-2 -0.00994  0.0142  -0.00219  4.34e-4
##  3 SPAC…  0.0125  -0.0153   0.00109 -2.17e-2  0.00303 -0.00454 -0.0264   1.89e-2
##  4 SPNC… -0.00461  0.00571  0.00790 -2.29e-2  0.00604  0.0204  -0.00352  1.15e-2
##  5 SPAC… -0.00176 -0.00581 -0.0139  -2.42e-2  0.00252 -0.00346  0.0223  -9.61e-3
##  6 SPAC…  0.0342  -0.0227  -0.00138 -1.52e-2  0.0112  -0.00609 -0.00245  2.42e-3
##  7 SPAC…  0.0126   0.0223  -0.00249  8.72e-3  0.0388  -0.0166   0.00410 -1.82e-2
##  8 SPAC…  0.0102   0.0182  -0.00752  1.84e-2  0.0636  -0.00837 -0.0220  -1.61e-2
##  9 SPNC…  0.0221  -0.0357   0.0146  -1.93e-2  0.0104   0.00874 -0.00480 -1.01e-2
## 10 SPAC… -0.0159  -0.0270  -0.0168   8.82e-4 -0.0249  -0.0288  -0.0236  -8.97e-4
## # … with 6,001 more rows, and 28 more variables: PC9 <dbl>, PC10 <dbl>,
## #   PC11 <dbl>, PC12 <dbl>, PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>,
## #   PC17 <dbl>, PC18 <dbl>, PC19 <dbl>, PC20 <dbl>, PC21 <dbl>, PC22 <dbl>,
## #   PC23 <dbl>, PC24 <dbl>, PC25 <dbl>, PC26 <dbl>, PC27 <dbl>, PC28 <dbl>,
## #   PC29 <dbl>, PC30 <dbl>, PC31 <dbl>, PC32 <dbl>, PC33 <dbl>, PC34 <dbl>,
## #   PC35 <dbl>, PC36 <dbl>

Because we have 6011 genes, it would be too much to visualise all of them at once. Instead, let’s try and answer the question: “What are the top 10 genes with highest loading on PC1 and PC2?”

There are different ways to do this, here we show one way, which would work for any number of PCs we are interested in:

top_genes <- pc_loadings %>% 
  # select only the PCs we are interested in
  select(gene, PC1, PC2) %>%
  # convert to a "long" format
  pivot_longer(matches("PC"), names_to = "PC", values_to = "loading") %>% 
  # for each PC
  group_by(PC) %>% 
  # arrange by descending order of loading
  arrange(desc(abs(loading))) %>% 
  # take the 10 top rows
  slice(1:10) %>% 
  # pull the gene column as a vector
  pull(gene) %>% 
  # ensure only unique genes are retained
  unique()

top_genes
##  [1] "SPAC23H3.15c"  "SPAC22A12.17c" "SPCPB16A4.07"  "SPBC24C6.09c" 
##  [5] "SPACUNK4.17"   "SPBC16E9.16c"  "SPBC660.05"    "SPAC32A11.02c"
##  [9] "SPBC21C3.19"   "SPAC139.05"    "SPAC27D7.09c"  "SPAC1002.19"  
## [13] "SPAC1002.17c"  "SPAC22H10.13"  "SPBC106.02c"   "SPAC19D5.01"  
## [17] "SPBC839.06"    "SPBC1105.14"

Now, we can use this list of gene names to subset the eigenvalues table:

top_loadings <- pc_loadings %>% 
  filter(gene %in% top_genes)

Exercise:

Recreate the plot below. Save the plot in an object called loadings_plot

Link to full exercise


Using the patchwork package, we put these two plots side by side to get a more complete and interpretable view of our PCA analysis:

library(patchwork)

# Adjust some aspects of each plot
pca_plot <- pca_plot + 
  coord_fixed(ratio = 0.4) + 
  labs(title = "PC scores") +
  theme(legend.position = "none")

loadings_plot <- loadings_plot + 
  coord_fixed(ratio = 0.4) + 
  labs(title = "PC loadings")

# Put them together
(pca_plot | loadings_plot) + plot_annotation(tag_levels = "A")

Notice how we’ve forced our PC1 axis to be around twice as long as the PC2 axis (with coord_fixed(). This is because the variance explained by PC1 is about twice as large as the variance explained by PC2.

Extra: Quick visualisation using autoplot()

This was all very useful to understand how PCA works and how to fully explore its results. However, it can sometimes be convenient to produce these graphs quickly.

We can use the autoplot() function from the ggfortify package to do this:

library(ggfortify)
autoplot(sample_pca)

There’s also information about how much variation each principal component axis explains.

We can annotate this plot further by providing the autoplot() function with our sample information data (see ?autoplot.prcomp for more information):

autoplot(sample_pca, data = sample_info, colour = "minute", shape = "strain")

Finally, you could easily add information about loadings to the plot by adding the options loadings = TRUE, loadings.label = TRUE. But don’t do it, because we have thousands of genes, so the plot would be un-interpretable (and probably crash your computer)!

# Example using iris dataset
autoplot(prcomp(iris[, -5]), data = iris, colour = "Species",
         loadings = TRUE, loadings.label = TRUE)

Extra: Tidying the prcomp() output with the {broom} package

One useful package that is part of the {tidyverse} collection is the {broom} package, which is used to convert the output of statistical objects into tibbles.

Although {broom} doesn’t recognise objects from every statistical object, it does work with prcomp objects, and we could have used it instead of tidying our data manually.

Here is how we could have used it:

library(broom)

# PC variances (eigen values)
tidy(sample_pca, matrix = "eigenvalues")

# variable loadings
tidy(sample_pca, matrix = "loadings")

Final thoughts

PCA is an essential and very powerful technique for exploratory data analysis of multivariate data. As you see from this example, we can quickly see how our data is structured, based on transcriptome-wide signatures.

Note that lack of clustering in a PCA does not mean nothing is happening. Imagine an RNAseq experiment where only a dozen genes changed due to your treatment. Because there’s few genes, transcriptome-wide you might not capture structure in the data.

In summary, although this analysis gives some indication of how much influence a particular gene has along each PC, it is not a replacement for specialised statistical models and tools that assess specific hypothesis for differential expression between conditions.

Finally, it’s worth pointing out that PCA is a linear algorithm. There are other methods for non-linear dimensionality reduction, which can sometimes perform better than PCA (but perhaps less easily interpretable). Two common methods are called UMAP and t-SNE. These can be found in the uwot and Rtsne packages, respectively.


back to lesson’s homepage