In your project’s directory, create a new script called 02_eda_count_data.R
(“eda” stands for “Exploratory Data Analysis”), and start with the following code:
We will start our data exploration by understanding how our expression data are distributed in each sample. We could do this, for example, using frequency polygons (similar to histograms) for our expression data found in the trans_cts
object.
To produce the plot above, we need to do a few things:
trans_cts
) to “long” format, i.e. with 3 columns: gene
, sample
, cts
, rather than one column per samplesample_info
tableThere’s two functions that allow us to convert tables from a “wide” to a “long” format and vice-versa: pivot_longer()
and pivot_wider()
.
What we want in our case is to go from a “wide” to “long” table, so we use pivot_longer()
, which needs four things:
Like so:
# "gather" the counts data
trans_cts_long <- trans_cts %>%
pivot_longer(cols = wt_0_r1:mut_180_r3,
names_to = "sample",
values_to = "cts")
trans_cts_long
## # A tibble: 216,396 x 3
## gene sample cts
## <chr> <chr> <dbl>
## 1 SPAC212.11 wt_0_r1 4.95
## 2 SPAC212.11 wt_0_r2 4.97
## 3 SPAC212.11 wt_0_r3 5.66
## 4 SPAC212.11 wt_15_r1 5.45
## 5 SPAC212.11 wt_15_r2 5.44
## 6 SPAC212.11 wt_15_r3 6.50
## 7 SPAC212.11 wt_30_r1 5.08
## 8 SPAC212.11 wt_30_r2 5.25
## 9 SPAC212.11 wt_30_r3 5.32
## 10 SPAC212.11 wt_60_r1 4.85
## # … with 216,386 more rows
If we wanted to do the reverse, we could use the pivot_wider()
function:
## # A tibble: 6,011 x 37
## gene wt_0_r1 wt_0_r2 wt_0_r3 wt_15_r1 wt_15_r2 wt_15_r3 wt_30_r1 wt_30_r2
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SPAC… 4.95 4.97 5.66 5.45 5.44 6.50 5.08 5.25
## 2 SPAC… 5.44 6.21 6.17 5.94 6.33 6.41 6.71 7.15
## 3 SPAC… 5.75 5.06 5.54 5.68 5.18 5.10 5.82 6.31
## 4 SPNC… 5.73 4.87 4.94 5.57 5.32 4.20 5.17 4.77
## 5 SPAC… 7.05 6.84 6.79 6.87 6.09 6.19 6.91 6.66
## 6 SPAC… 5.34 5.34 5.31 5.32 5.99 5.30 7.61 7.48
## 7 SPAC… 6.50 6.23 6.70 7.03 6.92 7.53 6.44 6.68
## 8 SPAC… 7.24 7.46 7.48 7.09 7.46 7.65 6.88 7.13
## 9 SPNC… 5.96 5.69 5.94 5.92 5.87 4.84 7.38 7.30
## 10 SPAC… 6.64 6.65 6.78 4.62 5.61 6.15 5.66 6.50
## # … with 6,001 more rows, and 28 more variables: wt_30_r3 <dbl>,
## # wt_60_r1 <dbl>, wt_60_r2 <dbl>, wt_60_r3 <dbl>, wt_120_r1 <dbl>,
## # wt_120_r2 <dbl>, wt_120_r3 <dbl>, wt_180_r1 <dbl>, wt_180_r2 <dbl>,
## # wt_180_r3 <dbl>, mut_0_r1 <dbl>, mut_0_r2 <dbl>, mut_0_r3 <dbl>,
## # mut_15_r1 <dbl>, mut_15_r2 <dbl>, mut_15_r3 <dbl>, mut_30_r1 <dbl>,
## # mut_30_r2 <dbl>, mut_30_r3 <dbl>, mut_60_r1 <dbl>, mut_60_r2 <dbl>,
## # mut_60_r3 <dbl>, mut_120_r1 <dbl>, mut_120_r2 <dbl>, mut_120_r3 <dbl>,
## # mut_180_r1 <dbl>, mut_180_r2 <dbl>, mut_180_r3 <dbl>
Exercise:
Convert the
raw_cts
created above to a “long” format using thepivot_longer()
function
The next thing we want to do is to add information about each sample to our gene expression table.
## # A tibble: 36 x 4
## sample strain minute replicate
## <chr> <chr> <dbl> <chr>
## 1 wt_0_r1 wt 0 r1
## 2 wt_0_r2 wt 0 r2
## 3 wt_0_r3 wt 0 r3
## 4 wt_15_r1 wt 15 r1
## 5 wt_15_r2 wt 15 r2
## 6 wt_15_r3 wt 15 r3
## 7 wt_30_r1 wt 30 r1
## 8 wt_30_r2 wt 30 r2
## 9 wt_30_r3 wt 30 r3
## 10 wt_60_r1 wt 60 r1
## # … with 26 more rows
We can do this by joining the trans_cts_long
table with the sample_info
table.
Joining tables is an important task that is often needed in multi-layered data.
There are several different kinds of joins that can be performed. Here’s a couple:
Look at the “Combine Data Sets” section of the dplyr cheatsheet to understand them better.
In our case, we know that all samples in our counts table also occur in the sample_info
table, so there’s a few different kinds of joins that would work.
For safety, let’s use full_join()
, to ensure we retain all data from both tables:
## # A tibble: 216,396 x 6
## gene sample cts strain minute replicate
## <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 SPAC212.11 wt_0_r1 4.95 wt 0 r1
## 2 SPAC212.11 wt_0_r2 4.97 wt 0 r2
## 3 SPAC212.11 wt_0_r3 5.66 wt 0 r3
## 4 SPAC212.11 wt_15_r1 5.45 wt 15 r1
## 5 SPAC212.11 wt_15_r2 5.44 wt 15 r2
## 6 SPAC212.11 wt_15_r3 6.50 wt 15 r3
## 7 SPAC212.11 wt_30_r1 5.08 wt 30 r1
## 8 SPAC212.11 wt_30_r2 5.25 wt 30 r2
## 9 SPAC212.11 wt_30_r3 5.32 wt 30 r3
## 10 SPAC212.11 wt_60_r1 4.85 wt 60 r1
## # … with 216,386 more rows
We can join tables by more than one column, which is not necessary here.
Now that we have our expression data in a “tidy” format, we can explore our gene’s expression distribution:
trans_cts_long %>%
ggplot(aes(cts, colour = replicate)) +
geom_freqpoly(binwidth = 1) +
facet_grid(strain ~ minute)
Exercise:
- Produce a similar plot for the raw count data. (hint: you might want to try log-transforming the data).
- Try out other ways to visualise these data, for example as a boxplot.
Scatter plots are useful e.g. to visually inspect similarity (correlation) between variables.
For this type of plot, the “spread” version of our counts table is useful. For example, let’s look at the correlation between two replicates of WT at 0 min:
As we can see, these are very tightly correlated, which might be expected given these are biological replicates of the same strain and treatment!
Exercise:
Compare the expression between a WT cell at T0 and T30. What can you conclude from this?
Because we have so many samples, it would be nice to get an overview of the correlation across all samples. One way to do this is with a correlation plot. The corrr
package provides with several convenience functions to calculate and visualise correlations. We won’t go into details in this lesson, but do check the package’s documentation if you are interested in knowing more. Here is a quick example:
To achieve our goal, we first have to calculate the gene expression correlation between every pair of samples. We can use the cor()
function to do this, and give it our table of gene counts:
# Calculate all correlations
trans_cts_corr <- trans_cts %>%
# remove the column "gene", which we do not want to calculate correlation on
select(-gene) %>%
# we use Spearman's correlation, a non-parametric metric based on ranks
cor(method = "spearman")
# Visualise the correlations between the first 5 samples
trans_cts_corr[1:5, 1:5]
## wt_0_r1 wt_0_r2 wt_0_r3 wt_15_r1 wt_15_r2
## wt_0_r1 1.0000000 0.9866445 0.9865771 0.9429146 0.9560555
## wt_0_r2 0.9866445 1.0000000 0.9934777 0.9344005 0.9524027
## wt_0_r3 0.9865771 0.9934777 1.0000000 0.9341774 0.9516065
## wt_15_r1 0.9429146 0.9344005 0.9341774 1.0000000 0.9749322
## wt_15_r2 0.9560555 0.9524027 0.9516065 0.9749322 1.0000000
Now, we can use the rplot()
function to help us visualise this:
So far we have looked at RNA-Seq data that has been preprocessed for us (trans_cts
) using a sophisticated method (DESeq2
).
In the following, we will also take a look at the raw normalised data (raw_cts
) and compare these to the transformed version to explore what is achieve by it.
Make sure that you have a “tidy” version of these raw normalised counts:
raw_cts_long <- raw_cts %>%
pivot_longer(-gene, names_to = "sample", values_to = "cts") %>%
full_join(sample_info, by = "sample")
The summary function is quite handy here as a first approach to exploring differences between raw and normalized counts.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 87 274 2042 704 8758155
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.198 6.915 8.273 8.369 9.532 23.062
Raw and normalized data live on VERY different scales - the raw_cts_long
has a few very large count values while most values are comparatively small. Partially, this is because the transformed counts are on a log-scale. Let’s visualise the correlation between two samples like we did above, but for these raw counts:
As you see, the range of values is very large and we can hardly see the relationship between these data. In this case, it helps to either log-transform the data beforehand (e.g. using mutate()
to modify the columns we want to), or change the axis scales directly on the plot:
# We add a "pseudocount" of 1 to the count data because otherwise log(0) = -Inf
raw_cts %>%
ggplot(aes(wt_0_r1 + 1, wt_0_r2 + 1)) +
geom_point() +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2")
From this, we can already see that the raw data seem more variable (especially for low values) than the normalised data. Let’s examine the relationship between mean and variance of each gene across all samples:
# Summarise the data and then plot
raw_cts_long %>%
# for each gene
group_by(gene) %>%
# get mean and variance
summarise(mean_cts = mean(cts),
var_cts = var(cts)) %>%
# plot one against the other
ggplot(aes(mean_cts, var_cts)) +
geom_point() +
geom_abline(colour = "brown") +
scale_x_continuous(trans = "log2") +
scale_y_continuous(trans = "log2")
## `summarise()` ungrouping output (override with `.groups` argument)
This positive relationship between mean and variance is very typical of count data. Count data are often modelled using a Poisson model; however in poisson-distributed data one expects the mean and variance to be the same. As can be seen from the plot above, although the variance increases with the mean, they are not the same (this is indicated by the x=y identity line in brown). This is why you need to be careful to apply specialised statistical methods (such as those provided by DESeq2
) that model these properties of the data (more specifically, DESeq2
uses a negative binomial model, which allows for different scaling of the variance in count data).
Compare this with the normalised data:
trans_cts_long %>%
group_by(gene) %>%
summarise(mean_cts = mean(cts),
var_cts = var(cts)) %>%
ggplot(aes(mean_cts, var_cts)) +
geom_point()
## `summarise()` ungrouping output (override with `.groups` argument)
In conclusion, these mean-variance plots verify that DESeq2
’s transformation approach is effectively achieving its main goal: to stabilize the variance, as the transformed data do no longer show a (strong) dependency between a gene’s mean and its variance.