We will need to load these 4 libraries
library(visdat) # for initial inspection of data
library(tidyverse) # for manipulation of data
library(plotly) # for building interactive plots
You should also set the working environment to the directory containing your course materials, for example:
setwd("/home/username/Documents/slcu_course/module04_practice/materials")
Please, note, here we are using read_tsv()
because original file is tab-separated, not comma-separated (‘csv’)
mydata <- read_tsv("../data/TPM-light-WT-17c-27c-RNA-seq-average-rep1-rep2_misexpressed.tsv")
## Parsed with column specification:
## cols(
## gene = col_character(),
## TPM_wt_17c_15min = col_double(),
## TPM_wt_17c_1hr = col_double(),
## TPM_wt_17c_4hr = col_double(),
## TPM_wt_27c_15min = col_double(),
## TPM_wt_27c_1hr = col_double(),
## TPM_wt_27c_4hr = col_double()
## )
Guide questions:
Hint: check the functions that we have discussed in the module02:
View()
,vis_dat()
,dim()
,glimpse()
,names()
dim(mydata) # dimensions of the table
## [1] 1035 7
names(mydata) # column names
## [1] "gene" "TPM_wt_17c_15min" "TPM_wt_17c_1hr"
## [4] "TPM_wt_17c_4hr" "TPM_wt_27c_15min" "TPM_wt_27c_1hr"
## [7] "TPM_wt_27c_4hr"
vis_dat(mydata) # visual representation of column types
Now, let’s reshape our data. We will do it in several steps and ultimately we will generate a tidy dataset, suitable for applying ggplot functions
Hint: one of two:
gather()
orspread()
should be able to help you. Check examples, documentation and select the appropriate function.
# gather the dataset - all columns except gene
data_long <- mydata %>%
gather(sample, expression, -gene)
# This would also work - all columns between TPM_wt_17c_15min and TPM_wt_27c_4hr
data_long <- mydata %>%
gather(sample, expression, TPM_wt_17c_15min:TPM_wt_27c_4hr)
# Or this - all columns that contain the word "TPM" in their name
data_long <- mydata %>%
gather(sample, expression, contains("TPM"))
head(data_long)
## # A tibble: 6 x 3
## gene sample expression
## <chr> <chr> <dbl>
## 1 AT1G01060 TPM_wt_17c_15min 377
## 2 AT1G01240 TPM_wt_17c_15min 7.70
## 3 AT1G01390 TPM_wt_17c_15min 13.4
## 4 AT1G01500 TPM_wt_17c_15min 4.96
## 5 AT1G02460 TPM_wt_17c_15min 1.90
## 6 AT1G02660 TPM_wt_17c_15min 11.4
Check dimesions in the long-format dataset, do they match your expectatios? Hint: compare with dimensions of the original dataset with dim() function.
dim(data_long)
## [1] 6210 3
dim(mydata)
## [1] 1035 7
You might have noticed that now you have a column that contains several variables cramped together: time and temperature. Let’s split it in 4 columns: units
, genotype
, time
and temperature
using separate()
function.
data_long <- separate(data_long, sample,
into = c("units", "genotype", "temperature", "time"),
sep = "_")
Does the transformed dataset looks as expected? Use View()
or head()
to make sure.
units
and genotype
columnsThe values in those two columns are always the same, so we want to remove them.
Hint: use
select()
function.
data_long <- select(data_long, gene, expression, temperature, time)
When dealing with expression data we often have lots of genes that are barely expressed. Let’s get rid of them, this should slightly reduce the size of our dataset.
Hint: use
filter()
to select rows with expression values greater than 1 TPM.
data_long <- filter(data_long, expression >= 1)
Don’t forget to check the dimensions. Can you tell how many rows have been dropped?
dim(data_long)
## [1] 6076 4
After all the transformations our dataset is ready to be plotted. Use any type of visualization you find suitable to the problem to get a basic understanding of the structure of your data.
Hint: initialize plot with `
ggplot()
and select a suitablegeom_*
. Have a look at the examples provided in the module02 materials.
# Violin plot of TPM distribution in two temperatures and three time points
ggplot(data_long, aes(time, expression, colour = time)) +
geom_violin() +
facet_grid(temperature ~ .)
The previous plot shows a highly-skewed distribution of TPM values. This is because some genes are very highly expressed. We can improve the visualisation by using a log-scale:
ggplot(data_long, aes(time, expression, colour = time)) +
geom_violin() +
facet_grid(temperature ~ .) +
scale_y_log10() + # change the y-axis to a log-scale
annotation_logticks(sides = "lr") # add some tick marks on the left and right side
Say, now we are interested in 10 specific genes and we want to visualize their expression in different temperatures over time.
Here are our genes of interest:
genes_of_interest <- c("AT1G67090", "AT5G19240", "AT1G31580", "AT3G12580",
"AT1G80920", "AT3G54660", "AT2G25110", "AT1G19530",
"AT3G23990", "AT3G30775")
Filter data_long for genes of interest:
data_long_interest <- filter(data_long, gene %in% genes_of_interest)
Build line plots for every gene of interest, facet by gene_id
. Can you put measurements for both temperatures in a single plot?
ggplot(data_long_interest, aes(time, expression, group = temperature)) +
geom_point(aes(color = temperature)) +
geom_line(aes(color = temperature)) +
facet_wrap(~ gene, nrow = 5, scales = 'free') +
ggtitle("my favourite genes")
Notice we had to use the group
aesthetic to tell ggplot which points should be connected by the lines. In this case, we want to connect points that correspond to the two temperatures. In other words our points are part of two groups, 17o and 27o.
Well done! Let’s save expression values for our genes of interest in a separate csv file.
write_csv(data_long_interest, "../data/my_favourite_genes.csv")
Now do the same with pipes (excluding exploratory stages)!
data <- read_tsv("../data/TPM-light-WT-17c-27c-RNA-seq-average-rep1-rep2_misexpressed.tsv")
## Parsed with column specification:
## cols(
## gene = col_character(),
## TPM_wt_17c_15min = col_double(),
## TPM_wt_17c_1hr = col_double(),
## TPM_wt_17c_4hr = col_double(),
## TPM_wt_27c_15min = col_double(),
## TPM_wt_27c_1hr = col_double(),
## TPM_wt_27c_4hr = col_double()
## )
genes_of_interest <- c("AT1G67090", "AT5G19240", "AT1G31580", "AT3G12580",
"AT1G80920", "AT3G54660", "AT2G25110", "AT1G19530",
"AT3G23990", "AT3G30775")
mydata %>%
gather(sample, expression, -gene) %>%
separate(sample, sep = '_',
into = c("units", "genotype", "temperature", "time")) %>%
select(c("gene", "expression", "temperature", "time")) %>%
filter(expression >= 1 & gene %in% genes_of_interest) %>%
ggplot(aes(time, expression, group = temperature)) +
geom_point(aes(color = temperature)) +
geom_line(aes(color = temperature)) +
facet_wrap(~ gene, nrow = 5, scales = 'free') +
ggtitle("my favourite genes")