Lesson objectives

This lesson complements the Data Carpentry lessons by providing an earlier introduction to ggplot2. The lesson is to be taught after Starting With Data but before the Data Manipulation lesson.

  • Identify the basic parts needed to build a graph with ggplot2.
    • defining data, aesthetics and geometries for a basic graph.
  • Distinguish when to use or not to use aes() to change graph’s aesthetics (e.g. colours, shapes).
  • Apply overlaying of multiple geometries on the same graph and define aesthetics separately for each.
  • Adjust and customise scales and labels in the graph.


Make sure you loaded the surveys data:

surveys <- read_csv("data/portal_data_joined.csv")

For this lesson we’re going to remove missing values from our table. This is not always desirable (you might be throwing away good data!), but it will help us make this lesson clearer.

We will use the function drop_na(), which removes rows with any missing data:

# remove rows with missing data in any of the columns
surveys_nomiss <- drop_na(surveys)

Building a ggplot2 graph

To build a ggplot2 graph you need 3 basic pieces of information:

  • A data.frame with data to be plotted
  • The variables (columns of data.frame) that will be mapped to different aesthetics of the graph (e.g. axis, colours, shapes, etc.)
  • the geometry that will be drawn on the graph (e.g. points, lines, boxplots, violinplots, etc.)

This translates into the following basic syntax:

ggplot(data = <data.frame>, 
       mapping = aes(x = <column of data.frame>, y = <column of data.frame>)) +
   geom_<type of geometry>()

For our first visualisation, let’s make a scatterplot showing the relationship between weight and hindfoot_length. Let’s do it step-by-step to see how ggplot2 works.

Start by giving data to ggplot:

ggplot(data = surveys_nomiss)

OK, that “worked” (as in, we didn’t get an error). But because we didn’t give ggplot() any variables to be mapped to aesthetic components of the graph, we just got an empty square.

For mappping columns to aesthetics, we use the aes() function:

ggplot(data = surveys_nomiss, 
       mapping = aes(x = weight, y = hindfoot_length))

That’s better, now we have some axis. Notice how ggplot() defines the axis based on the range of data given. But it’s still not a very interesting graph, because we didn’t tell what it is we want to draw on the graph.

This is done by adding (literally +) geometries to our graph:

ggplot(data = surveys_nomiss, 
       mapping = aes(x = weight, y = hindfoot_length)) +


  • Modify the graph above by plotting a density hexagon plot (geom_hex())
  • From this graph, we can see that there are different groups of observations. Either by yourself or with the person next to you discuss why this might be and how you would like to change the graph to investigate it.

Changing how geometries look like

We can change how geometries look like in several ways, for example their transparency, colour, shape, etc.

To know which aesthetic components can be changed in a particular geometry, look at its help (e.g. ?geom_point) and look under the “Aesthetics” section of the help page.

For example, because the points in the above graph are quite densely packed, we can change the transparency of the points in our scatterplot using alpha (alpha varies between 0-1 with zero being transparent and 1 being opaque):

ggplot(data = surveys_nomiss, 
       mapping = aes(x = weight, y = hindfoot_length)) +
  geom_point(alpha = 0.1)

With this transparency we can see which areas of the graph are more densely occupied with points.


Try changing the size, shape and colour of the points (hint: web search “ggplot2 point shapes” to see how to make a triangle)

Changing aesthetics based on data

In the above exercise we changed the colour of the points by defining it ourselves. However, it would be better if we coloured the points based on a variable of interest.

For example, it’s likely that the clustering of points in our scatterplot is due to differences between genera.

We do this by passing this information to the colour aesthetic inside the aes() function:

ggplot(data = surveys_nomiss, 
       mapping = aes(x = weight, y = hindfoot_length, colour = genus)) +

This illustrates an important distinction between aesthetics defined inside or outside of aes():

  • if you want the aesthetic to change based on the data it goes inside aes()
  • if you want to manually specify how the geometry should look like, it goes outside aes()


Make a boxplot that shows the distribution of weight (y-axis) for each genus (x-axis). (hint: geom_boxplot())

Bonus: Colour the inside of the boxplots by sex

Multiple geometries

Often, we may want to overlay several geometries on top of each other. For example, add a violin plot together with a boxplot so that we get both representations of the data in a single graph.

Let’s start by making a violin plot:

# scale the violins by "width" rather than "area", which is the default
ggplot(surveys_nomiss, aes(x = genus, y = weight)) +
  geom_violin(scale = "width")

To layer a boxplot on top of it we “add” (with +) another geometry to the graph:

# Make boxplots thinner so the shape of the violins is visible
ggplot(surveys_nomiss, aes(x = genus, y = weight)) +
  geom_violin(scale = "width") +
  geom_boxplot(width = 0.2)

The order in which you add the geometries defines the order they are “drawn” on the graph. For example, try swapping their order and see what happens (bonus: try adding transparency to the violin plot)

Notice how we’ve shortened our code by omitting the names of the options data = and mapping = inside ggplot(). Because the data is always the first thing given to ggplot() and the mapping is always identified by the function aes(), this is often written in the more compact form as we just did.

Controlling aesthetics in individual geometries

Let’s say that, in the graph above, we wanted to colour the violins by genus, but keep the boxplots without colour.

As we’ve learned, because we want to colour our geometries based on data, this goes inside the aes() part of the graph:

# use the `fill` aesthetic, which colours the **inside** of the geometry
ggplot(surveys_nomiss, aes(x = genus, y = weight, fill = genus)) +
  geom_violin(scale = "width") +
  geom_boxplot(width = 0.2)

OK, this is not what we wanted. Both geometries (boxplots and violins) got coloured.

It turns out that we can control aesthetics individually in each geometry, using the aes() function. Like this:

ggplot(surveys_nomiss, aes(x = genus, y = weight)) +
  geom_violin(aes(fill = genus), scale = "width") +
  geom_boxplot(width = 0.2)


Modify the graph above by colouring the inside of the boxplots by genus and the inside of the violins in grey colour.


You can split your plot into multiple panels by using facetting. There are two types of facet functions:

  • facet_wrap() arranges a one-dimensional sequence of panels to fit on one page.
  • facet_grid() allows you to form a matrix of rows and columns of panels.

Both geometries allow to to specify faceting variables specified with vars(). In general: facet_wrap(facets = vars(facet_variable)) or facet_grid(rows = vars(row_variable), cols = vars(col_variable)).

For example, if we want to visualise the scatterplot above split by genus:

ggplot(surveys_nomiss, aes(hindfoot_length, weight)) +
  geom_point() +
  facet_wrap(facets = vars(genus))

If instead we want a matrix of facets to display genus and sex, then we use facet_grid():

ggplot(surveys_nomiss, aes(hindfoot_length, weight)) +
  geom_point() +
  facet_grid(rows = vars(sex), cols = vars(genus))

Finally, with facet_grid(), you can organise the panels just by rows or just by columns. Try running this code yourself:

# One column, facet by rows
       mapping = aes(x = hindfoot_length, y = weight)) +
  geom_point() +
  facet_grid(rows = vars(genus))

# One row, facet by column
       mapping = aes(x = hindfoot_length, y = weight)) +
  geom_point() +
  facet_grid(cols = vars(genus))

Modifying scales

Often you want to change how the scales of your plot are defined. In ggplot2 scales can refer to the x and y aesthetics, but also to other aesthetics such as colour, shape, fill, etc.

We modify scales using the scale family of functions. These functions always follow the following naming convention: scale_<aesthetic>_<type>, where:

  • <aesthetic> refers to the aesthetic for that scale function (e.g. x, y, colour, fill, shape, etc.)
  • <type> refers to the type of aesthetic (e.g. discrete, continuous, manual)

Let’s see some examples.

Change a continuous axis scale

Taking the graph from the previous exercise we modify the y-axis scale to emphasise the lower weights of our animals:

ggplot(surveys_nomiss, aes(x = genus, y = weight)) +
  geom_violin(scale = "width", fill = "grey") +
  geom_boxplot(width = 0.2, aes(fill = genus)) +
  scale_y_continuous(limits = c(0, 100))
## Warning: Removed 2618 rows containing non-finite values (stat_ydensity).
## Warning: Removed 2618 rows containing non-finite values (stat_boxplot).

Change a discrete axis scale

Our x-axis is discrete (data are categorical). Let’s, for example, limit which categories are shown and in which order:

ggplot(surveys_nomiss, aes(x = genus, y = weight)) +
  geom_violin(scale = "width", fill = "grey") +
  geom_boxplot(width = 0.2, aes(fill = genus)) +
  scale_x_discrete(limits = c("Sigmodon", "Dipodomys", "Baiomys"))
## Warning: Removed 15885 rows containing non-finite values (stat_ydensity).
## Warning: Removed 15885 rows containing missing values (stat_boxplot).

Change categorical colour/fill scales

Taking the previous plot, let’s change the fill scale to define custom colours “manually”.

ggplot(surveys_nomiss, aes(x = genus, y = weight)) +
  geom_violin(scale = "width", fill = "grey") +
  geom_boxplot(width = 0.2, aes(fill = genus)) +
  scale_x_discrete(limits = c("Sigmodon", "Dipodomys", "Baiomys")) +
  scale_fill_manual(values = c("Sigmodon" = "purple", 
                               "Dipodomys" = "brown", 
                               "Baiomys" = "steelblue"))
## Warning: Removed 15885 rows containing non-finite values (stat_ydensity).
## Warning: Removed 15885 rows containing missing values (stat_boxplot).

For colour/fill scales there’s a very convenient variant of the scale function (“brewer”) that has some pre-defined palettes:

ggplot(surveys_nomiss, aes(x = genus, y = weight)) +
  geom_violin(scale = "width", fill = "grey") +
  geom_boxplot(width = 0.2, aes(fill = genus)) +
  scale_fill_brewer(palette = "Set1")

You can see all the available palettes here. Note that some palettes only have a limited number of colours and ggplot will give a warning if it has fewer colours available than categories in the data.

Change continuous fill/colour scales

Let’s get back to our scatterplot and colour the points by year of collection:

ggplot(surveys_nomiss, aes(x = weight, y = hindfoot_length)) +
  geom_point(aes(colour = year))

Because year is a continuous variable, ggplot created a gradient colour scale.

We can change the default:

ggplot(surveys_nomiss, aes(x = weight, y = hindfoot_length)) +
  geom_point(aes(colour = year)) +
  scale_colour_gradient(low = "grey", high = "brown")

For continuous colour scales we can use the viridis palette, which has been developed to be colour-blind friendly and perceptually better:

ggplot(surveys_nomiss, aes(x = weight, y = hindfoot_length)) +
  geom_point(aes(colour = year)) +

Key points

  • To build a ggplot2 graph you need to provide data, aesthetics and geometries.
  • If you want to change an aesthetic of our graph based on data, include it inside aes().
  • If you want to manually change an aesthetic regardless of data then it goes outside aes().
  • You can overlay multiple geometries in the same graph, and control their aesthetics individually.
  • You can adjust scales of your graph using scale_* family of functions.

Although we did not cover it here, make sure to choose a visualisation that is suitable for your data and question. See the data-to-viz website for great examples and advice!