back to course home

Advanced data manipulation with R

In this module we are going to see how to do more advanced data manipulation using dplyr, which is also a part of the tidyverse package family.

We will learn how to:

  • Combine (join) data sets
  • Gather columns into rows (from wide to long format)
  • Spread rows into columns (from long to wide format)
  • Order rows by values of a column

First let’s load the tidyverse package:

library(tidyverse)
library(forcats)

We will use example datasets that come with R (no need to read it from anywhere).

Combining datasets (*_join())

Often when analysing data we have many tables containing different types of related information. So, to answer a particular question it is a common task to combine multiple tables by a certain column or columns. The dplyr package contains a family of *_join functions that allow us to this.

All variations of the *_join() functions need three things:

  • the name of the first (left) data set
  • the name of the second (right) data set
  • the name of the column to use in order to combine both data sets - more than one column can be used for joining.

Let’s first have a look at the full_join() function, it will return all rows and all columns from both datasets. We will use 2 built-in datasets to illustrate its work:

  • band_members
band_members
## # A tibble: 3 x 2
##   name  band   
##   <chr> <chr>  
## 1 Mick  Stones 
## 2 John  Beatles
## 3 Paul  Beatles
  • band_instruments
band_instruments
## # A tibble: 3 x 2
##   name  plays 
##   <chr> <chr> 
## 1 John  guitar
## 2 Paul  bass  
## 3 Keith guitar

Say, we want to combine data form this two tables into one, so that we have information about name, band and instrument in one place:

band_members %>% full_join(band_instruments)
## Joining, by = "name"
## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

The alternative syntax, without pipes would be:

full_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

You can see that people with no information about their band or instrument have ’NA’s now in the respective column. Here we did not have to specify the column we wanted to join by because R was very smart and picked the only possible option - joining by the name column. Though in more complex scenarious, when there are more than 1 columns with the same name between 2 datasets - you should specify which one you would like to join by (or both).

Let’s look at other variations of the *_join() functions:

  • To keep only data from the band_members data set (left), we can use left_join():
left_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass
  • To keep only data for samples from the band_instruments data set (right), we can use right_join():
right_join(band_instruments, band_members)
## Joining, by = "name"
## # A tibble: 3 x 3
##   name  plays  band   
##   <chr> <chr>  <chr>  
## 1 Mick  <NA>   Stones 
## 2 John  guitar Beatles
## 3 Paul  bass   Beatles
  • To keep only data that is common between two data sets, we can use inner_join():
inner_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 2 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass

So, the result does not have NA’s any more

  • To keep only data that is in the left table but not on the right table, we can use anti_join():
anti_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 1 x 2
##   name  band  
##   <chr> <chr> 
## 1 Mick  Stones

… and the other way around:

anti_join(band_instruments, band_members)
## Joining, by = "name"
## # A tibble: 1 x 2
##   name  plays 
##   <chr> <chr> 
## 1 Keith guitar

These functions are very useful to compare big datasets:

  • inner_join() is equivalent to the intersect of the samples in two data sets.
  • full_join() is equivalent to the union of the samples in the two data sets.

Reshapping data (gather() and spread())

Often, we might have data in what is sometimes called a “wide” format.

For example, in the iris dataset, we have classic data of petal and sepal lengths and widths in three Iris species. In this table, each organ’s measurement has its own column:

# For convenience convert the iris dataset to a tibble (so it doesn't print out of the screen!)
iris <- as_tibble(iris)
iris
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1         5.10        3.50         1.40       0.200 setosa 
##  2         4.90        3.00         1.40       0.200 setosa 
##  3         4.70        3.20         1.30       0.200 setosa 
##  4         4.60        3.10         1.50       0.200 setosa 
##  5         5.00        3.60         1.40       0.200 setosa 
##  6         5.40        3.90         1.70       0.400 setosa 
##  7         4.60        3.40         1.40       0.300 setosa 
##  8         5.00        3.40         1.50       0.200 setosa 
##  9         4.40        2.90         1.40       0.200 setosa 
## 10         4.90        3.10         1.50       0.100 setosa 
## # ... with 140 more rows

This format is convenient if we wanted to plot the correlation between these measurements, for example:

ggplot(iris, aes(Petal.Length, Sepal.Length, colour = Species)) + 
  geom_point()

But what if we wanted to plot the distribution (as boxplots) of each organ’s length and width?

We would instead want our table to be in a “long” format, like this:

Species trait size
setosa Petal.Length 1.7
setosa Sepal.Length 5.1
setosa Sepal.Length 4.4
setosa Sepal.Length 4.4
setosa Sepal.Width 3.9
setosa Sepal.Width 3.7
versicolor Petal.Width 1.3
versicolor Sepal.Length 6.4
versicolor Sepal.Length 6.3
virginica Petal.Width 2.3

There is a pair of dplyr functions that can convert data from one format to the other. They are called gather() and spread().

Here is a schematic representation of what these two functions do:

To “gather” data (i.e. from a “wide” to a “long” format), we do:

iris_long <- iris %>% 
  mutate(flower_id = 1:n()) %>%         # Make an identifier for each flower
  gather(trait, size, -Species, -flower_id)
iris_long
## # A tibble: 600 x 4
##    Species flower_id trait         size
##    <fct>       <int> <chr>        <dbl>
##  1 setosa          1 Sepal.Length  5.10
##  2 setosa          2 Sepal.Length  4.90
##  3 setosa          3 Sepal.Length  4.70
##  4 setosa          4 Sepal.Length  4.60
##  5 setosa          5 Sepal.Length  5.00
##  6 setosa          6 Sepal.Length  5.40
##  7 setosa          7 Sepal.Length  4.60
##  8 setosa          8 Sepal.Length  5.00
##  9 setosa          9 Sepal.Length  4.40
## 10 setosa         10 Sepal.Length  4.90
## # ... with 590 more rows

gather() needs three things:

  • the name of the new column containing the old column names (in this case trait)
  • the name of the new column containing the values of the old columns (in this case size)
  • the name of all the old columns we want to gather together

Notice that gather() takes care of filling in the id column with the values corresponding to each new row of this table (in this case, each flower ID now appears four times - because we have four measurements per flower).

Now we can do the plot we wanted:

ggplot(iris_long, aes(trait, size, fill = Species)) + 
  geom_boxplot()

To “spread” the data (long –> wide), we would do:

iris_long %>% 
  spread(trait, size)
## # A tibble: 150 x 6
##    Species flower_id Petal.Length Petal.Width Sepal.Length Sepal.Width
##  * <fct>       <int>        <dbl>       <dbl>        <dbl>       <dbl>
##  1 setosa          1         1.40       0.200         5.10        3.50
##  2 setosa          2         1.40       0.200         4.90        3.00
##  3 setosa          3         1.30       0.200         4.70        3.20
##  4 setosa          4         1.50       0.200         4.60        3.10
##  5 setosa          5         1.40       0.200         5.00        3.60
##  6 setosa          6         1.70       0.400         5.40        3.90
##  7 setosa          7         1.40       0.300         4.60        3.40
##  8 setosa          8         1.50       0.200         5.00        3.40
##  9 setosa          9         1.40       0.200         4.40        2.90
## 10 setosa         10         1.50       0.100         4.90        3.10
## # ... with 140 more rows

spread() only needs two things:

  • the name of the column that contains the values that will become the new column names
  • the name of the column that contains the values of interest

Separate one column into several ones

Sometimes, we might have different kind of information in one column and want to split it into several ones. For this, we can use separate().

For example, in the iris_long object, we might want to separate the trait column into a column indicating the flower organ (petal or sepal) and another one indicating the type of measurement (width or length).

These two pieces of information are separated by a .:

unique(iris_long$trait)
## [1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"

Using separate(), we can separate the values in this column into two columns:

# Automatic guess
iris_long %>% 
  separate(trait, c("organ", "measure"))
## # A tibble: 600 x 5
##    Species flower_id organ measure  size
##  * <fct>       <int> <chr> <chr>   <dbl>
##  1 setosa          1 Sepal Length   5.10
##  2 setosa          2 Sepal Length   4.90
##  3 setosa          3 Sepal Length   4.70
##  4 setosa          4 Sepal Length   4.60
##  5 setosa          5 Sepal Length   5.00
##  6 setosa          6 Sepal Length   5.40
##  7 setosa          7 Sepal Length   4.60
##  8 setosa          8 Sepal Length   5.00
##  9 setosa          9 Sepal Length   4.40
## 10 setosa         10 Sepal Length   4.90
## # ... with 590 more rows

Note how separate() removed the original column and replaced it with the newly created ones.

With this new trick, we can now create a slightly different plot:

iris_long %>% 
  separate(trait, c("organ", "measure")) %>% 
  ggplot(aes(Species, size, fill = measure)) +
  geom_boxplot() +
  facet_grid( ~ organ)

In the code above, we let separate() “guess” what the separator was (by default it finds any non-alphanumeric character). But we can be explicit, by telling it the separator is a .. However, the dot . symbol is a special character used in regular expressions. This is beyond the scope of this course, but in summary, regular expressions allow us to find patterns in character strings. (see here if you want to know more about them - they can be very useful to match patterns in our data!)

Because of this special use of the dot . symbol, we need to tell the function to ignore the special meaning of the dot . symbol by prefixing it with \\:

iris_long %>% 
  separate(trait, c("organ", "measure"), 
           sep = "\\.", 
           remove = FALSE)
## # A tibble: 600 x 6
##    Species flower_id trait        organ measure  size
##  * <fct>       <int> <chr>        <chr> <chr>   <dbl>
##  1 setosa          1 Sepal.Length Sepal Length   5.10
##  2 setosa          2 Sepal.Length Sepal Length   4.90
##  3 setosa          3 Sepal.Length Sepal Length   4.70
##  4 setosa          4 Sepal.Length Sepal Length   4.60
##  5 setosa          5 Sepal.Length Sepal Length   5.00
##  6 setosa          6 Sepal.Length Sepal Length   5.40
##  7 setosa          7 Sepal.Length Sepal Length   4.60
##  8 setosa          8 Sepal.Length Sepal Length   5.00
##  9 setosa          9 Sepal.Length Sepal Length   4.40
## 10 setosa         10 Sepal.Length Sepal Length   4.90
## # ... with 590 more rows

Notice how above we also retained the original column name.

Note: the reverse of separate() is unite():

iris_long %>% 
  separate(trait, c("organ", "measure")) %>% 
  unite(trait, organ, measure, sep = ".")
## # A tibble: 600 x 4
##    Species flower_id trait         size
##  * <fct>       <int> <chr>        <dbl>
##  1 setosa          1 Sepal.Length  5.10
##  2 setosa          2 Sepal.Length  4.90
##  3 setosa          3 Sepal.Length  4.70
##  4 setosa          4 Sepal.Length  4.60
##  5 setosa          5 Sepal.Length  5.00
##  6 setosa          6 Sepal.Length  5.40
##  7 setosa          7 Sepal.Length  4.60
##  8 setosa          8 Sepal.Length  5.00
##  9 setosa          9 Sepal.Length  4.40
## 10 setosa         10 Sepal.Length  4.90
## # ... with 590 more rows

Changing order of labels in plots

In the following boxplot

ggplot(iris, aes(Species, Sepal.Width)) + 
  geom_boxplot()

the order in which Species are plotted is alphabetical. This is the case, even if the order of the Species on the table was not alphabetical.

But what if we wanted a different order?

For this, we need to convert our Species variable to a special type called factor. Factors are a special type of character variables, where we define levels, which are the unique values we have in our variable.

We can convert a variable to a factor using the factor() function:

# Convert Species to a factor:
factor(iris$Species)
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

We can see that our Species variable contains three levels, corresponding to each of the species. The levels were automatically defined based on their alphabetical order. We can change this by specifying the levels ourselves:

factor(iris$Species, levels = c("versicolor", "virginica", "setosa"))
##   [1] setosa     setosa     setosa     setosa     setosa     setosa    
##   [7] setosa     setosa     setosa     setosa     setosa     setosa    
##  [13] setosa     setosa     setosa     setosa     setosa     setosa    
##  [19] setosa     setosa     setosa     setosa     setosa     setosa    
##  [25] setosa     setosa     setosa     setosa     setosa     setosa    
##  [31] setosa     setosa     setosa     setosa     setosa     setosa    
##  [37] setosa     setosa     setosa     setosa     setosa     setosa    
##  [43] setosa     setosa     setosa     setosa     setosa     setosa    
##  [49] setosa     setosa     versicolor versicolor versicolor versicolor
##  [55] versicolor versicolor versicolor versicolor versicolor versicolor
##  [61] versicolor versicolor versicolor versicolor versicolor versicolor
##  [67] versicolor versicolor versicolor versicolor versicolor versicolor
##  [73] versicolor versicolor versicolor versicolor versicolor versicolor
##  [79] versicolor versicolor versicolor versicolor versicolor versicolor
##  [85] versicolor versicolor versicolor versicolor versicolor versicolor
##  [91] versicolor versicolor versicolor versicolor versicolor versicolor
##  [97] versicolor versicolor versicolor versicolor virginica  virginica 
## [103] virginica  virginica  virginica  virginica  virginica  virginica 
## [109] virginica  virginica  virginica  virginica  virginica  virginica 
## [115] virginica  virginica  virginica  virginica  virginica  virginica 
## [121] virginica  virginica  virginica  virginica  virginica  virginica 
## [127] virginica  virginica  virginica  virginica  virginica  virginica 
## [133] virginica  virginica  virginica  virginica  virginica  virginica 
## [139] virginica  virginica  virginica  virginica  virginica  virginica 
## [145] virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: versicolor virginica setosa

Once we convert our variable to a factor, we can use it to plot categorical variables in the order we wish:

# Using mutate to modify the Species column into a factor with custom level order
iris %>% 
  mutate(Species = factor(Species, levels = c("virginica", "versicolor", "setosa"))) %>% 
  ggplot(aes(Species, Sepal.Width)) + 
  geom_boxplot()

What if we wanted to order the species based on their Sepal.Length? The forcats package (part of tidyverse) has a very useful function for this, fct_reorder()!

iris %>% 
  mutate(Species = fct_reorder(Species, Sepal.Width)) %>% 
  ggplot(aes(Species, Sepal.Width)) + 
  geom_boxplot()

Sorting data

Sometimes, we might want to order our data based on the values in one column. For this, we can use arrange():

iris %>% 
  arrange(Petal.Width)
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1         4.90        3.10         1.50       0.100 setosa 
##  2         4.80        3.00         1.40       0.100 setosa 
##  3         4.30        3.00         1.10       0.100 setosa 
##  4         5.20        4.10         1.50       0.100 setosa 
##  5         4.90        3.60         1.40       0.100 setosa 
##  6         5.10        3.50         1.40       0.200 setosa 
##  7         4.90        3.00         1.40       0.200 setosa 
##  8         4.70        3.20         1.30       0.200 setosa 
##  9         4.60        3.10         1.50       0.200 setosa 
## 10         5.00        3.60         1.40       0.200 setosa 
## # ... with 140 more rows

The default is to order values from low to high. If we want to order from high to low, we have to specify it using desc():

iris %>% 
  arrange(desc(Petal.Width))
## # A tibble: 150 x 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>    
##  1         6.30        3.30         6.00        2.50 virginica
##  2         7.20        3.60         6.10        2.50 virginica
##  3         6.70        3.30         5.70        2.50 virginica
##  4         5.80        2.80         5.10        2.40 virginica
##  5         6.30        3.40         5.60        2.40 virginica
##  6         6.70        3.10         5.60        2.40 virginica
##  7         6.40        3.20         5.30        2.30 virginica
##  8         7.70        2.60         6.90        2.30 virginica
##  9         6.90        3.20         5.70        2.30 virginica
## 10         7.70        3.00         6.10        2.30 virginica
## # ... with 140 more rows

Key points to remember

  • dplyr has many many useful functions to manipulate tables
  • Today we have seen:
    • left_join(), right_join(), full_join(), inner_join() and anti_join()
    • gather()
    • spread()
    • arrange()
    • separate()
  • Change the order of categorical variables in your graphs using factor() and fct_reorder()
  • There are many dplyr functions we did not cover, but two additional useful ones are:
    • rename() to change column names
    • top_n() to filter the table based on the top/bottom entries of a column