back to course home

Module04 practice task - solutions

1. Set up the environment

We will need to load these 4 libraries

library(visdat) # for initial inspection of data
library(tidyverse) # for manipulation of data
library(plotly) # for building interactive plots

You should also set the working environment to the directory containing your course materials, for example:

setwd("/home/username/Documents/slcu_course/module04_practice/materials")

2. Download the data

Please, note, here we are using read_tsv() because original file is tab-separated, not comma-separated (‘csv’)

mydata <- read_tsv("../data/TPM-light-WT-17c-27c-RNA-seq-average-rep1-rep2_misexpressed.tsv")
## Parsed with column specification:
## cols(
##   gene = col_character(),
##   TPM_wt_17c_15min = col_double(),
##   TPM_wt_17c_1hr = col_double(),
##   TPM_wt_17c_4hr = col_double(),
##   TPM_wt_27c_15min = col_double(),
##   TPM_wt_27c_1hr = col_double(),
##   TPM_wt_27c_4hr = col_double()
## )

3. Check that dataset does not contain obvious problems

Guide questions:

  • how many dimensions does the data have?
  • what are the column names?
  • what kind of data do these variables contain?
  • what are variable types (numeric, character)?
  • are there any missing values?

Hint: check the functions that we have discussed in the module02: View(), vis_dat(), dim(), glimpse(), names()

dim(mydata) # dimensions of the table
## [1] 1035    7
names(mydata)  # column names
## [1] "gene"             "TPM_wt_17c_15min" "TPM_wt_17c_1hr"  
## [4] "TPM_wt_17c_4hr"   "TPM_wt_27c_15min" "TPM_wt_27c_1hr"  
## [7] "TPM_wt_27c_4hr"
vis_dat(mydata)  # visual representation of column types

4. Data manipulation

Now, let’s reshape our data. We will do it in several steps and ultimately we will generate a tidy dataset, suitable for applying ggplot functions

4.1. Reshape dataset from wide to long format

Hint: one of two: gather() or spread() should be able to help you. Check examples, documentation and select the appropriate function.

# gather the dataset - all columns except gene
data_long <- mydata %>% 
    gather(sample, expression, -gene)

# This would also work - all columns between TPM_wt_17c_15min and TPM_wt_27c_4hr
data_long <- mydata %>% 
    gather(sample, expression, TPM_wt_17c_15min:TPM_wt_27c_4hr)

# Or this - all columns that contain the word "TPM" in their name
data_long <- mydata %>% 
    gather(sample, expression, contains("TPM"))

head(data_long)
## # A tibble: 6 x 3
##   gene      sample           expression
##   <chr>     <chr>                 <dbl>
## 1 AT1G01060 TPM_wt_17c_15min     377   
## 2 AT1G01240 TPM_wt_17c_15min       7.70
## 3 AT1G01390 TPM_wt_17c_15min      13.4 
## 4 AT1G01500 TPM_wt_17c_15min       4.96
## 5 AT1G02460 TPM_wt_17c_15min       1.90
## 6 AT1G02660 TPM_wt_17c_15min      11.4

Check dimesions in the long-format dataset, do they match your expectatios? Hint: compare with dimensions of the original dataset with dim() function.

dim(data_long)
## [1] 6210    3
dim(mydata)
## [1] 1035    7

4.2. Tidy varibales

You might have noticed that now you have a column that contains several variables cramped together: time and temperature. Let’s split it in 4 columns: units, genotype, time and temperature using separate() function.

data_long <- separate(data_long, sample, 
                      into = c("units", "genotype", "temperature", "time"),
                      sep = "_")

Does the transformed dataset looks as expected? Use View() or head() to make sure.

4.3 Drop units and genotype columns

The values in those two columns are always the same, so we want to remove them.

Hint: use select() function.

data_long <- select(data_long, gene, expression, temperature, time)

4.4 Filter rows with low expression values

When dealing with expression data we often have lots of genes that are barely expressed. Let’s get rid of them, this should slightly reduce the size of our dataset.

Hint: use filter() to select rows with expression values greater than 1 TPM.

data_long <- filter(data_long, expression >= 1)

Don’t forget to check the dimensions. Can you tell how many rows have been dropped?

dim(data_long)
## [1] 6076    4

5. Build plots

5.1. Initial visualization

After all the transformations our dataset is ready to be plotted. Use any type of visualization you find suitable to the problem to get a basic understanding of the structure of your data.

Hint: initialize plot with `ggplot() and select a suitable geom_*. Have a look at the examples provided in the module02 materials.

# Violin plot of TPM distribution in two temperatures and three time points
ggplot(data_long, aes(time, expression, colour = time)) +
    geom_violin() +
    facet_grid(temperature ~ .)

The previous plot shows a highly-skewed distribution of TPM values. This is because some genes are very highly expressed. We can improve the visualisation by using a log-scale:

ggplot(data_long, aes(time, expression, colour = time)) +
  geom_violin() +
  facet_grid(temperature ~ .) +
  scale_y_log10() +                 # change the y-axis to a log-scale
  annotation_logticks(sides = "lr")  # add some tick marks on the left and right side

5.2 Plot your favourite genes.

Say, now we are interested in 10 specific genes and we want to visualize their expression in different temperatures over time.

Here are our genes of interest:

genes_of_interest <- c("AT1G67090", "AT5G19240", "AT1G31580", "AT3G12580",
                       "AT1G80920", "AT3G54660", "AT2G25110", "AT1G19530",
                       "AT3G23990", "AT3G30775")

Filter data_long for genes of interest:

data_long_interest <- filter(data_long, gene %in% genes_of_interest)

Build line plots for every gene of interest, facet by gene_id. Can you put measurements for both temperatures in a single plot?

ggplot(data_long_interest, aes(time, expression, group = temperature)) +
    geom_point(aes(color = temperature)) +
    geom_line(aes(color = temperature)) +
    facet_wrap(~ gene, nrow = 5, scales = 'free') +
    ggtitle("my favourite genes")

Notice we had to use the group aesthetic to tell ggplot which points should be connected by the lines. In this case, we want to connect points that correspond to the two temperatures. In other words our points are part of two groups, 17o and 27o.

6. Save transformed data to file

Well done! Let’s save expression values for our genes of interest in a separate csv file.

write_csv(data_long_interest, "../data/my_favourite_genes.csv")

Bonus challenge for superheroes

Now do the same with pipes (excluding exploratory stages)!

data <- read_tsv("../data/TPM-light-WT-17c-27c-RNA-seq-average-rep1-rep2_misexpressed.tsv")
## Parsed with column specification:
## cols(
##   gene = col_character(),
##   TPM_wt_17c_15min = col_double(),
##   TPM_wt_17c_1hr = col_double(),
##   TPM_wt_17c_4hr = col_double(),
##   TPM_wt_27c_15min = col_double(),
##   TPM_wt_27c_1hr = col_double(),
##   TPM_wt_27c_4hr = col_double()
## )
genes_of_interest <- c("AT1G67090", "AT5G19240", "AT1G31580", "AT3G12580",
                       "AT1G80920", "AT3G54660", "AT2G25110", "AT1G19530",
                       "AT3G23990", "AT3G30775")

mydata %>% 
    gather(sample, expression, -gene) %>%
    separate(sample, sep = '_',
             into = c("units", "genotype", "temperature", "time")) %>%
    select(c("gene", "expression", "temperature", "time")) %>%
    filter(expression >= 1 & gene %in% genes_of_interest) %>%
    ggplot(aes(time, expression, group = temperature)) +
    geom_point(aes(color = temperature)) +
    geom_line(aes(color = temperature)) +
    facet_wrap(~ gene, nrow = 5, scales = 'free') +
    ggtitle("my favourite genes")