In this lesson we will learn how to explore tabular data, perform basic sanity checks and do visualizations with ggplot2
.
In this lesson we are going to use some R packages containing functions we need:
tidyverse
which contains a collection of several packages for data manipulation and visualization (including the ggplot2
package)visdat
to give you a quick and visual representation of your dataplotly
to produce interactive graphicsYou can install packages by using the install.packages()
function. For example, if you wanted to install the plotly
package you would do:
install.packages("plotly")
Once the packages are installed, we can load them in the current R session with the library()
function:
library(tidyverse)
library(visdat)
library(plotly)
You should also set your working directory to the folder containing this lesson’s materials. This will vary depending on your operating system.
setwd("~/Desktop/slcu_r_course/module02_data_viz_ggplot/materials")
In this example, our working directory was set to the module02_data_viz_ggplot/materials
folder.
Now our workspace is ready. The only thing missing is the actual data set that we want to explore and visualize!
For this lesson we will use a slightly simplified version of a dataset published by Burghard et al 2015.
The simplified version of the data set can be found in the module02_data_viz_ggplot/data
folder (burghardt_et_al_2015_expt1.csv
).
Because our working directory is in module02_data_viz_ggplot/materials
folder, we read the data like so:
# Read data and store it in expt1 object
expt1 <- read_csv("../data/burghardt_et_al_2015_expt1.csv")
## Parsed with column specification:
## cols(
## genotype = col_character(),
## background = col_character(),
## temperature = col_integer(),
## fluctuation = col_character(),
## day.length = col_integer(),
## vernalization = col_character(),
## survival.bolt = col_character(),
## bolt = col_character(),
## days.to.bolt = col_integer(),
## days.to.flower = col_integer(),
## rosette.leaf.num = col_integer(),
## cauline.leaf.num = col_integer(),
## blade.length.mm = col_double(),
## total.leaf.length.mm = col_double(),
## blade.ratio = col_double()
## )
You will notice the read_csv()
function gives you a message referring to a “column specification”. This is referring to what type of data it thinks each of your columns contains.
In this case, some columns contain “character”-type data (i.e. text) and others contain “numeric”-type data (which can be “integer” if they have no decimal points, or “double” if they do).
To get a glimpse of the data, we can type the name of the variable we have stored it in (expt1
).
expt1
## # A tibble: 957 x 15
## genotype background temperature fluctuation day.length vernalization
## <chr> <chr> <int> <chr> <int> <chr>
## 1 Col Ama Col 12 Con 16 NV
## 2 Col Ama Col 12 Con 16 NV
## 3 Col Ama Col 12 Con 16 NV
## 4 Col Ama Col 12 Con 16 NV
## 5 Col Ama Col 12 Con 16 NV
## 6 Col Ama Col 12 Con 16 NV
## 7 Col Ama Col 12 Con 16 NV
## 8 Col Ama Col 12 Con 16 NV
## 9 Col Ama Col 12 Con 8 NV
## 10 Col Ama Col 12 Con 8 NV
## # ... with 947 more rows, and 9 more variables: survival.bolt <chr>,
## # bolt <chr>, days.to.bolt <int>, days.to.flower <int>,
## # rosette.leaf.num <int>, cauline.leaf.num <int>, blade.length.mm <dbl>,
## # total.leaf.length.mm <dbl>, blade.ratio <dbl>
This shows you the first 10 lines the data and only a few columns that fit on the screen.
Challenge: How many rows and columns does our data set contain?
There are other ways to visually inspect your data:
View()
function you can have access to an interactive table, where you can sort, filter and search your data set with keywords (this does not modify your original variable):View(expt1)
glimpse()
function you can get access to the structure of your data:glimpse(expt1)
## Observations: 957
## Variables: 15
## $ genotype <chr> "Col Ama", "Col Ama", "Col Ama", "Col Ama...
## $ background <chr> "Col", "Col", "Col", "Col", "Col", "Col",...
## $ temperature <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1...
## $ fluctuation <chr> "Con", "Con", "Con", "Con", "Con", "Con",...
## $ day.length <int> 16, 16, 16, 16, 16, 16, 16, 16, 8, 8, 8, ...
## $ vernalization <chr> "NV", "NV", "NV", "NV", "NV", "NV", "NV",...
## $ survival.bolt <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ bolt <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "...
## $ days.to.bolt <int> 28, 29, 31, 31, 32, 33, 34, 35, 69, 72, 7...
## $ days.to.flower <int> 43, 44, 43, 42, 44, 47, 47, 49, 90, 91, 9...
## $ rosette.leaf.num <int> 18, 15, 13, 17, 19, 14, 15, 18, 53, 49, 5...
## $ cauline.leaf.num <int> 6, 5, 4, 5, 4, 4, 3, 5, 6, 5, 6, 9, 6, 9,...
## $ blade.length.mm <dbl> 12.9, 10.5, 13.2, 14.6, 13.3, 14.7, 13.0,...
## $ total.leaf.length.mm <dbl> 21.1, 19.1, 23.4, 27.2, 20.4, 25.3, 23.2,...
## $ blade.ratio <dbl> 0.6113744, 0.5497382, 0.5641026, 0.536764...
Challenge what types of variables do we have in our data set? What could they mean? What type of variable is “bolt”? How about “temperature”?
dim()
returns basic dimensions of your data set, i.e. numbers of rows and columns.dim(expt1)
## [1] 957 15
summary()
function, which generates basic stats for each numeric column.summary(expt1)
## genotype background temperature fluctuation
## Length:957 Length:957 Min. :12.00 Length:957
## Class :character Class :character 1st Qu.:12.00 Class :character
## Mode :character Mode :character Median :12.00 Mode :character
## Mean :16.98
## 3rd Qu.:22.00
## Max. :22.00
##
## day.length vernalization survival.bolt bolt
## Min. : 8.00 Length:957 Length:957 Length:957
## 1st Qu.: 8.00 Class :character Class :character Class :character
## Median :16.00 Mode :character Mode :character Mode :character
## Mean :12.01
## 3rd Qu.:16.00
## Max. :16.00
##
## days.to.bolt days.to.flower rosette.leaf.num cauline.leaf.num
## Min. : 15.00 Min. : 21.00 Min. : 5.00 Min. : 1.000
## 1st Qu.: 38.00 1st Qu.: 46.00 1st Qu.: 24.00 1st Qu.: 5.000
## Median : 57.00 Median : 66.00 Median : 40.00 Median : 8.000
## Mean : 66.04 Mean : 71.59 Mean : 39.71 Mean : 7.208
## 3rd Qu.: 85.00 3rd Qu.: 92.00 3rd Qu.: 53.00 3rd Qu.: 9.000
## Max. :162.00 Max. :182.00 Max. :112.00 Max. :17.000
## NA's :83 NA's :95 NA's :96
## blade.length.mm total.leaf.length.mm blade.ratio
## Min. : 7.10 Min. : 9.00 Min. :0.0000
## 1st Qu.:18.00 1st Qu.:29.10 1st Qu.:0.5564
## Median :20.95 Median :34.60 Median :0.5948
## Mean :21.11 Mean :34.69 Mean :0.5874
## 3rd Qu.:24.30 3rd Qu.:40.27 3rd Qu.:0.6342
## Max. :59.00 Max. :66.30 Max. :6.5556
## NA's :327 NA's :303 NA's :304
So far, we have already used a handful of R functions, though we have just barely started:
install.packages()
library()
read_csv()
View()
glimpse()
summary()
dim()
Of course, it is difficult to memorize all the function names, what they are doing and how you should use them. Luckily, R has very convenient built-in help. To use it, type name of a function or any other object you are interested it preceded by ?
?summary
R help might seem cryptic at first, but you will get used to it, you can always scroll down to the ‘examples’ section and try running some of them yourself to get an idea of what the function in question is capable of.
Also, a web-search is also just as effective (if not more!) as looking at the help pages.
Challenge what does the
round()
function do?
Challenge can you specifically look at the end of your data instead of its beginning? How would you do it in R? (hint:
?tail
)
To get a “birds-eye” view at the data, to identify its structure and potential problems we can simply plot it in Rothko-style using the vis_dat()
function.
vis_dat(expt1)
Challenge what is the most common data type in our data set ? Can you spot any potential problems?
You might have noticed that some of the cells are plotted in gray - these indicate missing values. Missing values can occur in a data set when a certain observation was not collected and cause potential problems in the downstream analysis if we are not careful. There are a couple of strategies on how to deal with missing values:
To stay on the safe side now, we will simply remove all the rows with missing values from our beloved expt1
data set.
expt1 <- drop_na(expt1)
Challenge How many rows are left in the data set after we have dropped missing values?
Now that we have learned some basics of our data set, we will go straight to plotting to get even more insights about this experiment.
For this we will be using the ggplot2
package, which follows a general scheme termed “grammar or graphics”. “Grammar of graphics” might sound scary, but just think about them as simple building blocks of a plot. By combining and layering several blocks we can create our dream plot for a dream paper or for a lab meeting.
To build a graph we need several blocks:
Let’s focus on the first three: data, aesthetics and geometric object.
geom_object
. Examples include:
geom_point
for scatter plots, dot plots)geom_line
for trend lines, time series)aes()
function. Note, different geom_
objects can understand only a subset of aesthetics. For details, check their respective help (e.g. ?geom_line
)You can find more imformation about how to build graphs with ggplot2
in this very useful cheatsheet.
Everyone (except Excel) likes boxplots, so we will start by plotting days.to.flower
variable measured for different genotypes.
The ggplot()
function initialises a plot. At the very minimum it needs a dataset to plot:
ggplot(expt1)
But this simply produces a blank (well, grey) canvas!
We haven’t told ggplot
what aesthetics (this is ggplot2
terminology) we want it to map onto this blank canvas. For a boxplot we need to tell it what our x and y variables are.
ggplot(expt1, aes(x = genotype, y = days.to.flower))
As you can see, ggplot
“mapped” the values in the genotype
and days.to.flower
variables of our table to the x and y aesthetics of the plot.
But this is still quite an empty plot, because we haven’t told ggplot
what geometries we want it to draw in the canvas. In our case, we want a boxplot, which we can add on top of the created canvas by adding (literally +
) a geom_boxplot()
:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot()
Challenge: can you make a violin plot instead? (hint:
?geom_violin
)
Let’s now layer a couple of geom_objects
on the same plot. Say, we want to have points for the individual values together with our boxplots:
ggplot(expt1, aes(genotype, rosette.leaf.num)) +
geom_jitter() +
geom_boxplot()
Challenge: can you modify this plot so that the points appear on top of the boxplots rather than behind them?
We can also modify the appearance of our geometry, for example it’s colour:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot(colour = "red")
Or perhaps the colour that fills the boxplots:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot(colour = "red", fill = "royalblue")
Or even its transparency:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot(colour = "red", fill = "royalblue", alpha = 0.5)
This is all very colourful, but rather gratuitous (what is this colour telling us about the data?!).
What if we wanted to colour our boxplots according to which fluctuation treatment the plants were exposed to? In ggplot2
language, we want to “map” the values of fluctuation
onto the colour aesthetic of our plot. This should therefore go inside the aes()
part of our graph:
ggplot(expt1, aes(genotype, days.to.flower, colour = fluctuation)) +
geom_boxplot()
Wow! Can you see what ggplot did for you!? It automatically split the data of each genotype into two groups and coloured them accordingly.
Now, let’s say we wanted to visualise the individual data points (not coloured) behind our boxplots (coloured by fluctuation):
ggplot(expt1, aes(genotype, days.to.flower, colour = fluctuation)) +
geom_jitter() +
geom_boxplot(alpha = 0.5)
As it is, the colour
aesthetic is mapped to all geometries of the graph. This is because we defined it within the ggplot()
function, which affects every geom_object
that comes afterwards.
But we can also define aesthetics inside each geometry, for example:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_jitter() +
geom_boxplot(aes(fill = fluctuation), alpha = 0.5)
Challenge: say we are particularly interested in the relationship between number of rosette leafs and blade length in mm per genotype.
Visualize this relationship with a scatter plot (
geom_point()
) betweenblade.length.mm
androsette.leaf.num
and colour the points bygenotype
.What happens if you colour the points by
days.to.bolt
?
Often, our data has several grouping variables, and colours alone are not enough to fully represent the differences in the dataset.
For example, the scatterplot produced in the previous exercise is pretty, but very crowded. What if we wanted to isolate each genotype in individual plots?
This easy to accomplish with ggplot2
by adding a “facet” layer to our plot. There are two types of facets:
facet_grid()
- arranges sub-plots in rows and/or columnsfacet_wrap()
- arranges sub-plots in a ribbon that “wraps” around after a fixed number of plotsLet’s start with facet_grid()
and see it in action:
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = genotype)) +
geom_point() +
facet_grid(genotype ~ temperature)
In the code above, we use facet_grid()
to define variables that partition our data by rows and columns, using the notation (rows ~ columns)
.
Challenge: In the previous graph, colouring the genotype is redundant with the facetting. Can you think of a more useful way to colour the points?
It is possible to use facet_grid()
with a single variable:
# Facet by rows
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
geom_point() +
facet_grid(genotype ~ .)
# Facet by columns
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
geom_point() +
facet_grid(. ~ genotype)
When we are only partitioning by one variable, often facet_wrap()
produces a better display. For example:
ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
geom_point() +
facet_wrap( ~ genotype)
Challenge: Can you modify the previous graph to facet the data by the
fluctuation
treatment (as rows) andday.length
(as columns) and colour the points by genotype.
In conclusion, by effectively combining facets, colours and other aesthetics you can represent many dimensions of your data in a single graph!
Challenge: Can you produce a graph similar to .
Hint: facet the plot by
day.length
andtemperature
and fill the boxplots byfluctuation
.
But even this is not the limit. We can easily turn our plots into interactive ones using the plotly
package.
First we store our plot in a variable and then pass it to the special ggplotly()
function.
# Store plot in a variable called p1
p1 <- ggplot(expt1, aes(blade.length.mm, rosette.leaf.num, colour = fluctuation)) +
geom_point() +
facet_wrap(~genotype)
# Render an interactive plot using ggplotly function
ggplotly(p1)
Every element of a ggplot is modifiable. This is out of the scope for this module, but here’s a few examples and references.
Themes modify the overall appearance of the plot. Some come with ggplot2
and many others can be obtained from other packages such as ggthemes
(which also has some additional geom objects).
# Example of built-in ggplot2 themes
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot() +
theme_bw() +
labs(title = "Black and white theme")
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot() +
theme_classic() +
labs(title = "Classic theme")
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Minimal theme")
The theme()
function is used to modify individual elements of the plot. The possibilities are so vast that the easiest way is to do a web-search for your intended purpose.
For example, a web-search for “vertical labels x axis ggplot2” returns as one of the first hits this solution:
ggplot(expt1, aes(genotype, days.to.flower)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
Or searching for “altering plot colours ggplot2” returns this page, which somewhere gives an interesting solution:
ggplot(expt1, aes(genotype, days.to.flower, fill = fluctuation)) +
geom_boxplot() +
scale_fill_brewer(palette="Dark2")
Based on the principles outlined in this module, try and build a graph of your own dataset using ggplot2
.
If you encounter any difficulties, we will discuss them in the next module!
Some other packages that add functionality to ggplot2
: