back to course home

Data Visualization with R

In this lesson we will learn how to explore tabular data, perform basic sanity checks and do visualizations with ggplot2.

Setting things up

In this lesson we are going to use some R packages containing functions we need:

  • tidyverse which contains a collection of several packages for data manipulation and visualization (including the ggplot2 package)
  • visdat to give you a quick and visual representation of your data
  • plotly to produce interactive graphics

You can install packages by using the install.packages() function. For example, if you wanted to install the plotly package you would do:

install.packages("plotly")

Once the packages are installed, we can load them in the current R session with the library() function:

library(tidyverse)
library(visdat)
library(plotly)

You should also set your working directory to the folder containing this lesson’s materials. This will vary depending on your operating system.

setwd("~/Desktop/slcu_r_course/module02_data_viz_ggplot/materials")

In this example, our working directory was set to the module02_data_viz_ggplot/materials folder.

Now our workspace is ready. The only thing missing is the actual data set that we want to explore and visualize!

For this lesson we will use a slightly simplified version of a dataset published by Burghard et al 2015.

Read and check the data

The simplified version of the data set can be found in the module02_data_viz_ggplot/data folder (burghardt_et_al_2015_expt1.csv).

Because our working directory is in module02_data_viz_ggplot/materials folder, we read the data like so:

# Read data and store it in expt1 object
expt1 <- read_csv("../data/burghardt_et_al_2015_expt1.csv")
## Parsed with column specification:
## cols(
##   genotype = col_character(),
##   background = col_character(),
##   temperature = col_integer(),
##   fluctuation = col_character(),
##   day.length = col_integer(),
##   vernalization = col_character(),
##   survival.bolt = col_character(),
##   bolt = col_character(),
##   days.to.bolt = col_integer(),
##   days.to.flower = col_integer(),
##   rosette.leaf.num = col_integer(),
##   cauline.leaf.num = col_integer(),
##   blade.length.mm = col_double(),
##   total.leaf.length.mm = col_double(),
##   blade.ratio = col_double()
## )

You will notice the read_csv() function gives you a message referring to a “column specification”. This is referring to what type of data it thinks each of your columns contains.

In this case, some columns contain “character”-type data (i.e. text) and others contain “numeric”-type data (which can be “integer” if they have no decimal points, or “double” if they do).

Inspecting the data - quality control

To get a glimpse of the data, we can type the name of the variable we have stored it in (expt1).

expt1
## # A tibble: 957 x 15
##    genotype background temperature fluctuation day.length vernalization
##    <chr>    <chr>            <int> <chr>            <int> <chr>        
##  1 Col Ama  Col                 12 Con                 16 NV           
##  2 Col Ama  Col                 12 Con                 16 NV           
##  3 Col Ama  Col                 12 Con                 16 NV           
##  4 Col Ama  Col                 12 Con                 16 NV           
##  5 Col Ama  Col                 12 Con                 16 NV           
##  6 Col Ama  Col                 12 Con                 16 NV           
##  7 Col Ama  Col                 12 Con                 16 NV           
##  8 Col Ama  Col                 12 Con                 16 NV           
##  9 Col Ama  Col                 12 Con                  8 NV           
## 10 Col Ama  Col                 12 Con                  8 NV           
## # ... with 947 more rows, and 9 more variables: survival.bolt <chr>,
## #   bolt <chr>, days.to.bolt <int>, days.to.flower <int>,
## #   rosette.leaf.num <int>, cauline.leaf.num <int>, blade.length.mm <dbl>,
## #   total.leaf.length.mm <dbl>, blade.ratio <dbl>

This shows you the first 10 lines the data and only a few columns that fit on the screen.

Challenge: How many rows and columns does our data set contain?

There are other ways to visually inspect your data:

  • with the View() function you can have access to an interactive table, where you can sort, filter and search your data set with keywords (this does not modify your original variable):
View(expt1)
  • with the glimpse() function you can get access to the structure of your data:
glimpse(expt1)
## Observations: 957
## Variables: 15
## $ genotype             <chr> "Col Ama", "Col Ama", "Col Ama", "Col Ama...
## $ background           <chr> "Col", "Col", "Col", "Col", "Col", "Col",...
## $ temperature          <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1...
## $ fluctuation          <chr> "Con", "Con", "Con", "Con", "Con", "Con",...
## $ day.length           <int> 16, 16, 16, 16, 16, 16, 16, 16, 8, 8, 8, ...
## $ vernalization        <chr> "NV", "NV", "NV", "NV", "NV", "NV", "NV",...
## $ survival.bolt        <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ bolt                 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ days.to.bolt         <int> 28, 29, 31, 31, 32, 33, 34, 35, 69, 72, 7...
## $ days.to.flower       <int> 43, 44, 43, 42, 44, 47, 47, 49, 90, 91, 9...
## $ rosette.leaf.num     <int> 18, 15, 13, 17, 19, 14, 15, 18, 53, 49, 5...
## $ cauline.leaf.num     <int> 6, 5, 4, 5, 4, 4, 3, 5, 6, 5, 6, 9, 6, 9,...
## $ blade.length.mm      <dbl> 12.9, 10.5, 13.2, 14.6, 13.3, 14.7, 13.0,...
## $ total.leaf.length.mm <dbl> 21.1, 19.1, 23.4, 27.2, 20.4, 25.3, 23.2,...
## $ blade.ratio          <dbl> 0.6113744, 0.5497382, 0.5641026, 0.536764...

Challenge what types of variables do we have in our data set? What could they mean? What type of variable is “bolt”? How about “temperature”?

  • dim() returns basic dimensions of your data set, i.e. numbers of rows and columns.
dim(expt1)
## [1] 957  15
  • finally, for numeric variables it is convenient to use summary() function, which generates basic stats for each numeric column.
summary(expt1)
##    genotype          background         temperature    fluctuation       
##  Length:957         Length:957         Min.   :12.00   Length:957        
##  Class :character   Class :character   1st Qu.:12.00   Class :character  
##  Mode  :character   Mode  :character   Median :12.00   Mode  :character  
##                                        Mean   :16.98                     
##                                        3rd Qu.:22.00                     
##                                        Max.   :22.00                     
##                                                                          
##    day.length    vernalization      survival.bolt          bolt          
##  Min.   : 8.00   Length:957         Length:957         Length:957        
##  1st Qu.: 8.00   Class :character   Class :character   Class :character  
##  Median :16.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :12.01                                                           
##  3rd Qu.:16.00                                                           
##  Max.   :16.00                                                           
##                                                                          
##   days.to.bolt    days.to.flower   rosette.leaf.num cauline.leaf.num
##  Min.   : 15.00   Min.   : 21.00   Min.   :  5.00   Min.   : 1.000  
##  1st Qu.: 38.00   1st Qu.: 46.00   1st Qu.: 24.00   1st Qu.: 5.000  
##  Median : 57.00   Median : 66.00   Median : 40.00   Median : 8.000  
##  Mean   : 66.04   Mean   : 71.59   Mean   : 39.71   Mean   : 7.208  
##  3rd Qu.: 85.00   3rd Qu.: 92.00   3rd Qu.: 53.00   3rd Qu.: 9.000  
##  Max.   :162.00   Max.   :182.00   Max.   :112.00   Max.   :17.000  
##                   NA's   :83       NA's   :95       NA's   :96      
##  blade.length.mm total.leaf.length.mm  blade.ratio    
##  Min.   : 7.10   Min.   : 9.00        Min.   :0.0000  
##  1st Qu.:18.00   1st Qu.:29.10        1st Qu.:0.5564  
##  Median :20.95   Median :34.60        Median :0.5948  
##  Mean   :21.11   Mean   :34.69        Mean   :0.5874  
##  3rd Qu.:24.30   3rd Qu.:40.27        3rd Qu.:0.6342  
##  Max.   :59.00   Max.   :66.30        Max.   :6.5556  
##  NA's   :327     NA's   :303          NA's   :304

So far, we have already used a handful of R functions, though we have just barely started:

  • install.packages()
  • library()
  • read_csv()
  • View()
  • glimpse()
  • summary()
  • dim()

Of course, it is difficult to memorize all the function names, what they are doing and how you should use them. Luckily, R has very convenient built-in help. To use it, type name of a function or any other object you are interested it preceded by ?

?summary

R help might seem cryptic at first, but you will get used to it, you can always scroll down to the ‘examples’ section and try running some of them yourself to get an idea of what the function in question is capable of.

Also, a web-search is also just as effective (if not more!) as looking at the help pages.

Challenge what does the round() function do?

Challenge can you specifically look at the end of your data instead of its beginning? How would you do it in R? (hint: ?tail)

Visual inspection of your data

To get a “birds-eye” view at the data, to identify its structure and potential problems we can simply plot it in Rothko-style using the vis_dat() function.

vis_dat(expt1)

Challenge what is the most common data type in our data set ? Can you spot any potential problems?

Missing values

You might have noticed that some of the cells are plotted in gray - these indicate missing values. Missing values can occur in a data set when a certain observation was not collected and cause potential problems in the downstream analysis if we are not careful. There are a couple of strategies on how to deal with missing values:

  • remove rows with missing values completely (the safest option, though can result in substantial data loss);
  • ignore missing values when you can (can you?);
  • impute value based on surrounding values (the most risky).

To stay on the safe side now, we will simply remove all the rows with missing values from our beloved expt1 data set.

expt1 <- drop_na(expt1)

Challenge How many rows are left in the data set after we have dropped missing values?

Plots! Plots! Plots!

Now that we have learned some basics of our data set, we will go straight to plotting to get even more insights about this experiment.

For this we will be using the ggplot2 package, which follows a general scheme termed “grammar or graphics”. “Grammar of graphics” might sound scary, but just think about them as simple building blocks of a plot. By combining and layering several blocks we can create our dream plot for a dream paper or for a lab meeting.

To build a graph we need several blocks:

  • data
  • aesthetics
  • geometric object (type of a plot)
  • statistical transformations
  • coordinate system
  • positional adjustments
  • faceting

Let’s focus on the first three: data, aesthetics and geometric object.

  • data - well, this is obvious, we need some data
  • geom_objects - actual objects that we put on a plot. A plot must have at least one geom_object. Examples include:
    • points (geom_point for scatter plots, dot plots)
    • lines (geom_line for trend lines, time series)
  • aesthetics - things you can see and that depend on the data. For example, the position (x and y), colour, shape, line type, size, etc… Aesthetics can be set with aes() function. Note, different geom_ objects can understand only a subset of aesthetics. For details, check their respective help (e.g. ?geom_line)

You can find more imformation about how to build graphs with ggplot2 in this very useful cheatsheet.

Building a graph with ggplot2

Everyone (except Excel) likes boxplots, so we will start by plotting days.to.flower variable measured for different genotypes.

The ggplot() function initialises a plot. At the very minimum it needs a dataset to plot:

ggplot(expt1)

But this simply produces a blank (well, grey) canvas!

We haven’t told ggplot what aesthetics (this is ggplot2 terminology) we want it to map onto this blank canvas. For a boxplot we need to tell it what our x and y variables are.

ggplot(expt1, aes(x = genotype, y = days.to.flower))

As you can see, ggplot “mapped” the values in the genotype and days.to.flower variables of our table to the x and y aesthetics of the plot.

But this is still quite an empty plot, because we haven’t told ggplot what geometries we want it to draw in the canvas. In our case, we want a boxplot, which we can add on top of the created canvas by adding (literally +) a geom_boxplot():

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot()

Challenge: can you make a violin plot instead? (hint: ?geom_violin)

Adding multiple layers

Let’s now layer a couple of geom_objects on the same plot. Say, we want to have points for the individual values together with our boxplots:

ggplot(expt1, aes(genotype, rosette.leaf.num)) +
  geom_jitter() +
  geom_boxplot()

Challenge: can you modify this plot so that the points appear on top of the boxplots rather than behind them?

Colours!

We can also modify the appearance of our geometry, for example it’s colour:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot(colour = "red")

Or perhaps the colour that fills the boxplots:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot(colour = "red", fill = "royalblue")

Or even its transparency:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot(colour = "red", fill = "royalblue", alpha = 0.5)

This is all very colourful, but rather gratuitous (what is this colour telling us about the data?!).

What if we wanted to colour our boxplots according to which fluctuation treatment the plants were exposed to? In ggplot2 language, we want to “map” the values of fluctuation onto the colour aesthetic of our plot. This should therefore go inside the aes() part of our graph:

ggplot(expt1, aes(genotype, days.to.flower, colour = fluctuation)) +
  geom_boxplot()

Wow! Can you see what ggplot did for you!? It automatically split the data of each genotype into two groups and coloured them accordingly.

Now, let’s say we wanted to visualise the individual data points (not coloured) behind our boxplots (coloured by fluctuation):

ggplot(expt1, aes(genotype, days.to.flower, colour = fluctuation)) +
  geom_jitter() +
  geom_boxplot(alpha = 0.5)

As it is, the colour aesthetic is mapped to all geometries of the graph. This is because we defined it within the ggplot() function, which affects every geom_object that comes afterwards.

But we can also define aesthetics inside each geometry, for example:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_jitter() +
  geom_boxplot(aes(fill = fluctuation), alpha = 0.5)

Challenge: say we are particularly interested in the relationship between number of rosette leafs and blade length in mm per genotype.

Visualize this relationship with a scatter plot (geom_point()) between blade.length.mm and rosette.leaf.num and colour the points by genotype.

What happens if you colour the points by days.to.bolt?

Facetting

Often, our data has several grouping variables, and colours alone are not enough to fully represent the differences in the dataset.

For example, the scatterplot produced in the previous exercise is pretty, but very crowded. What if we wanted to isolate each genotype in individual plots?

This easy to accomplish with ggplot2 by adding a “facet” layer to our plot. There are two types of facets:

  • facet_grid() - arranges sub-plots in rows and/or columns
  • facet_wrap() - arranges sub-plots in a ribbon that “wraps” around after a fixed number of plots

Let’s start with facet_grid() and see it in action:

ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = genotype)) +
    geom_point() +
    facet_grid(genotype ~ temperature)

In the code above, we use facet_grid() to define variables that partition our data by rows and columns, using the notation (rows ~ columns).

Challenge: In the previous graph, colouring the genotype is redundant with the facetting. Can you think of a more useful way to colour the points?

It is possible to use facet_grid() with a single variable:

# Facet by rows
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
    geom_point() +
    facet_grid(genotype ~ .)

# Facet by columns
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
    geom_point() +
    facet_grid(. ~ genotype)

When we are only partitioning by one variable, often facet_wrap() produces a better display. For example:

ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
    geom_point() +
    facet_wrap( ~ genotype)

Challenge: Can you modify the previous graph to facet the data by the fluctuation treatment (as rows) and day.length (as columns) and colour the points by genotype.

In conclusion, by effectively combining facets, colours and other aesthetics you can represent many dimensions of your data in a single graph!


Challenge: Can you produce a graph similar to Fig. 2B-C of Burghard et al 2015.

Hint: facet the plot by day.length and temperature and fill the boxplots by fluctuation.

Interactivity!

But even this is not the limit. We can easily turn our plots into interactive ones using the plotly package.

First we store our plot in a variable and then pass it to the special ggplotly() function.

# Store plot in a variable called p1
p1 <- ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) + 
  geom_point() +
  facet_wrap(~genotype)

# Render an interactive plot using ggplotly function
ggplotly(p1)

Themes

Every element of a ggplot is modifiable. This is out of the scope for this module, but here’s a few examples and references.

Themes modify the overall appearance of the plot. Some come with ggplot2 and many others can be obtained from other packages such as ggthemes (which also has some additional geom objects).

# Example of built-in ggplot2 themes
ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot() +
  theme_bw() +
  labs(title = "Black and white theme")

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot() +
  theme_classic() +
  labs(title = "Classic theme")

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot() +
  theme_minimal() +
  labs(title = "Minimal theme")

The theme() function is used to modify individual elements of the plot. The possibilities are so vast that the easiest way is to do a web-search for your intended purpose.

For example, a web-search for “vertical labels x axis ggplot2” returns as one of the first hits this solution:

ggplot(expt1, aes(genotype, days.to.flower)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Or searching for “altering plot colours ggplot2” returns this page, which somewhere gives an interesting solution:

ggplot(expt1, aes(genotype, days.to.flower, fill = fluctuation)) +
  geom_boxplot() +
  scale_fill_brewer(palette="Dark2")

Homework

Based on the principles outlined in this module, try and build a graph of your own dataset using ggplot2.

If you encounter any difficulties, we will discuss them in the next module!

Extras

Some other packages that add functionality to ggplot2:

  • gridExtra or patchwork to combine several plots together
  • ggthemes to add extra themes and geometries
  • ggridges to produce “ridge” plots
  • GGally for automatically plotting relationships between data