In this module we are going to see how to do more advanced data manipulation using dplyr
, which is also a part of the tidyverse
package family.
We will learn how to:
First let’s load the tidyverse
package:
library(tidyverse)
library(forcats)
We will use example datasets that come with R (no need to read it from anywhere).
*_join()
)Often when analysing data we have many tables containing different types of related information. So, to answer a particular question it is a common task to combine multiple tables by a certain column or columns. The dplyr
package contains a family of *_join
functions that allow us to this.
All variations of the *_join()
functions need three things:
Let’s first have a look at the full_join()
function, it will return all rows and all columns from both datasets. We will use 2 built-in datasets to illustrate its work:
band_members
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
Say, we want to combine data form this two tables into one, so that we have information about name, band and instrument in one place:
band_members %>% full_join(band_instruments)
## Joining, by = "name"
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
The alternative syntax, without pipes would be:
full_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
You can see that people with no information about their band or instrument have ’NA’s now in the respective column. Here we did not have to specify the column we wanted to join by because R was very smart and picked the only possible option - joining by the name
column. Though in more complex scenarious, when there are more than 1 columns with the same name between 2 datasets - you should specify which one you would like to join by (or both).
Let’s look at other variations of the *_join()
functions:
band_members
data set (left), we can use left_join()
:left_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
band_instruments
data set (right), we can use right_join()
:right_join(band_instruments, band_members)
## Joining, by = "name"
## # A tibble: 3 x 3
## name plays band
## <chr> <chr> <chr>
## 1 Mick <NA> Stones
## 2 John guitar Beatles
## 3 Paul bass Beatles
inner_join()
:inner_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 2 x 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
So, the result does not have NA’s any more
anti_join()
:anti_join(band_members, band_instruments)
## Joining, by = "name"
## # A tibble: 1 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
… and the other way around:
anti_join(band_instruments, band_members)
## Joining, by = "name"
## # A tibble: 1 x 2
## name plays
## <chr> <chr>
## 1 Keith guitar
These functions are very useful to compare big datasets:
inner_join()
is equivalent to the intersect of the samples in two data sets.full_join()
is equivalent to the union of the samples in the two data sets.gather()
and spread()
)Often, we might have data in what is sometimes called a “wide” format.
For example, in the iris
dataset, we have classic data of petal and sepal lengths and widths in three Iris species. In this table, each organ’s measurement has its own column:
# For convenience convert the iris dataset to a tibble (so it doesn't print out of the screen!)
iris <- as_tibble(iris)
iris
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.10 3.50 1.40 0.200 setosa
## 2 4.90 3.00 1.40 0.200 setosa
## 3 4.70 3.20 1.30 0.200 setosa
## 4 4.60 3.10 1.50 0.200 setosa
## 5 5.00 3.60 1.40 0.200 setosa
## 6 5.40 3.90 1.70 0.400 setosa
## 7 4.60 3.40 1.40 0.300 setosa
## 8 5.00 3.40 1.50 0.200 setosa
## 9 4.40 2.90 1.40 0.200 setosa
## 10 4.90 3.10 1.50 0.100 setosa
## # ... with 140 more rows
This format is convenient if we wanted to plot the correlation between these measurements, for example:
ggplot(iris, aes(Petal.Length, Sepal.Length, colour = Species)) +
geom_point()
But what if we wanted to plot the distribution (as boxplots) of each organ’s length and width?
We would instead want our table to be in a “long” format, like this:
Species | trait | size |
---|---|---|
setosa | Petal.Length | 1.7 |
setosa | Sepal.Length | 5.1 |
setosa | Sepal.Length | 4.4 |
setosa | Sepal.Length | 4.4 |
setosa | Sepal.Width | 3.9 |
setosa | Sepal.Width | 3.7 |
versicolor | Petal.Width | 1.3 |
versicolor | Sepal.Length | 6.4 |
versicolor | Sepal.Length | 6.3 |
virginica | Petal.Width | 2.3 |
There is a pair of dplyr
functions that can convert data from one format to the other. They are called gather()
and spread()
.
Here is a schematic representation of what these two functions do:
To “gather” data (i.e. from a “wide” to a “long” format), we do:
iris_long <- iris %>%
mutate(flower_id = 1:n()) %>% # Make an identifier for each flower
gather(trait, size, -Species, -flower_id)
iris_long
## # A tibble: 600 x 4
## Species flower_id trait size
## <fct> <int> <chr> <dbl>
## 1 setosa 1 Sepal.Length 5.10
## 2 setosa 2 Sepal.Length 4.90
## 3 setosa 3 Sepal.Length 4.70
## 4 setosa 4 Sepal.Length 4.60
## 5 setosa 5 Sepal.Length 5.00
## 6 setosa 6 Sepal.Length 5.40
## 7 setosa 7 Sepal.Length 4.60
## 8 setosa 8 Sepal.Length 5.00
## 9 setosa 9 Sepal.Length 4.40
## 10 setosa 10 Sepal.Length 4.90
## # ... with 590 more rows
gather()
needs three things:
trait
)size
)Notice that gather()
takes care of filling in the id
column with the values corresponding to each new row of this table (in this case, each flower ID now appears four times - because we have four measurements per flower).
Now we can do the plot we wanted:
ggplot(iris_long, aes(trait, size, fill = Species)) +
geom_boxplot()
To “spread” the data (long –> wide), we would do:
iris_long %>%
spread(trait, size)
## # A tibble: 150 x 6
## Species flower_id Petal.Length Petal.Width Sepal.Length Sepal.Width
## * <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 setosa 1 1.40 0.200 5.10 3.50
## 2 setosa 2 1.40 0.200 4.90 3.00
## 3 setosa 3 1.30 0.200 4.70 3.20
## 4 setosa 4 1.50 0.200 4.60 3.10
## 5 setosa 5 1.40 0.200 5.00 3.60
## 6 setosa 6 1.70 0.400 5.40 3.90
## 7 setosa 7 1.40 0.300 4.60 3.40
## 8 setosa 8 1.50 0.200 5.00 3.40
## 9 setosa 9 1.40 0.200 4.40 2.90
## 10 setosa 10 1.50 0.100 4.90 3.10
## # ... with 140 more rows
spread()
only needs two things:
Sometimes, we might have different kind of information in one column and want to split it into several ones. For this, we can use separate()
.
For example, in the iris_long
object, we might want to separate the trait
column into a column indicating the flower organ (petal or sepal) and another one indicating the type of measurement (width or length).
These two pieces of information are separated by a .
:
unique(iris_long$trait)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
Using separate()
, we can separate the values in this column into two columns:
# Automatic guess
iris_long %>%
separate(trait, c("organ", "measure"))
## # A tibble: 600 x 5
## Species flower_id organ measure size
## * <fct> <int> <chr> <chr> <dbl>
## 1 setosa 1 Sepal Length 5.10
## 2 setosa 2 Sepal Length 4.90
## 3 setosa 3 Sepal Length 4.70
## 4 setosa 4 Sepal Length 4.60
## 5 setosa 5 Sepal Length 5.00
## 6 setosa 6 Sepal Length 5.40
## 7 setosa 7 Sepal Length 4.60
## 8 setosa 8 Sepal Length 5.00
## 9 setosa 9 Sepal Length 4.40
## 10 setosa 10 Sepal Length 4.90
## # ... with 590 more rows
Note how separate()
removed the original column and replaced it with the newly created ones.
With this new trick, we can now create a slightly different plot:
iris_long %>%
separate(trait, c("organ", "measure")) %>%
ggplot(aes(Species, size, fill = measure)) +
geom_boxplot() +
facet_grid( ~ organ)
In the code above, we let separate()
“guess” what the separator was (by default it finds any non-alphanumeric character). But we can be explicit, by telling it the separator is a .
. However, the dot .
symbol is a special character used in regular expressions. This is beyond the scope of this course, but in summary, regular expressions allow us to find patterns in character strings. (see here if you want to know more about them - they can be very useful to match patterns in our data!)
Because of this special use of the dot .
symbol, we need to tell the function to ignore the special meaning of the dot .
symbol by prefixing it with \\
:
iris_long %>%
separate(trait, c("organ", "measure"),
sep = "\\.",
remove = FALSE)
## # A tibble: 600 x 6
## Species flower_id trait organ measure size
## * <fct> <int> <chr> <chr> <chr> <dbl>
## 1 setosa 1 Sepal.Length Sepal Length 5.10
## 2 setosa 2 Sepal.Length Sepal Length 4.90
## 3 setosa 3 Sepal.Length Sepal Length 4.70
## 4 setosa 4 Sepal.Length Sepal Length 4.60
## 5 setosa 5 Sepal.Length Sepal Length 5.00
## 6 setosa 6 Sepal.Length Sepal Length 5.40
## 7 setosa 7 Sepal.Length Sepal Length 4.60
## 8 setosa 8 Sepal.Length Sepal Length 5.00
## 9 setosa 9 Sepal.Length Sepal Length 4.40
## 10 setosa 10 Sepal.Length Sepal Length 4.90
## # ... with 590 more rows
Notice how above we also retained the original column name.
Note: the reverse of separate()
is unite()
:
iris_long %>%
separate(trait, c("organ", "measure")) %>%
unite(trait, organ, measure, sep = ".")
## # A tibble: 600 x 4
## Species flower_id trait size
## * <fct> <int> <chr> <dbl>
## 1 setosa 1 Sepal.Length 5.10
## 2 setosa 2 Sepal.Length 4.90
## 3 setosa 3 Sepal.Length 4.70
## 4 setosa 4 Sepal.Length 4.60
## 5 setosa 5 Sepal.Length 5.00
## 6 setosa 6 Sepal.Length 5.40
## 7 setosa 7 Sepal.Length 4.60
## 8 setosa 8 Sepal.Length 5.00
## 9 setosa 9 Sepal.Length 4.40
## 10 setosa 10 Sepal.Length 4.90
## # ... with 590 more rows
In the following boxplot
ggplot(iris, aes(Species, Sepal.Width)) +
geom_boxplot()
the order in which Species
are plotted is alphabetical. This is the case, even if the order of the Species on the table was not alphabetical.
But what if we wanted a different order?
For this, we need to convert our Species
variable to a special type called factor. Factors are a special type of character variables, where we define levels, which are the unique values we have in our variable.
We can convert a variable to a factor using the factor()
function:
# Convert Species to a factor:
factor(iris$Species)
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: setosa versicolor virginica
We can see that our Species
variable contains three levels, corresponding to each of the species. The levels were automatically defined based on their alphabetical order. We can change this by specifying the levels ourselves:
factor(iris$Species, levels = c("versicolor", "virginica", "setosa"))
## [1] setosa setosa setosa setosa setosa setosa
## [7] setosa setosa setosa setosa setosa setosa
## [13] setosa setosa setosa setosa setosa setosa
## [19] setosa setosa setosa setosa setosa setosa
## [25] setosa setosa setosa setosa setosa setosa
## [31] setosa setosa setosa setosa setosa setosa
## [37] setosa setosa setosa setosa setosa setosa
## [43] setosa setosa setosa setosa setosa setosa
## [49] setosa setosa versicolor versicolor versicolor versicolor
## [55] versicolor versicolor versicolor versicolor versicolor versicolor
## [61] versicolor versicolor versicolor versicolor versicolor versicolor
## [67] versicolor versicolor versicolor versicolor versicolor versicolor
## [73] versicolor versicolor versicolor versicolor versicolor versicolor
## [79] versicolor versicolor versicolor versicolor versicolor versicolor
## [85] versicolor versicolor versicolor versicolor versicolor versicolor
## [91] versicolor versicolor versicolor versicolor versicolor versicolor
## [97] versicolor versicolor versicolor versicolor virginica virginica
## [103] virginica virginica virginica virginica virginica virginica
## [109] virginica virginica virginica virginica virginica virginica
## [115] virginica virginica virginica virginica virginica virginica
## [121] virginica virginica virginica virginica virginica virginica
## [127] virginica virginica virginica virginica virginica virginica
## [133] virginica virginica virginica virginica virginica virginica
## [139] virginica virginica virginica virginica virginica virginica
## [145] virginica virginica virginica virginica virginica virginica
## Levels: versicolor virginica setosa
Once we convert our variable to a factor, we can use it to plot categorical variables in the order we wish:
# Using mutate to modify the Species column into a factor with custom level order
iris %>%
mutate(Species = factor(Species, levels = c("virginica", "versicolor", "setosa"))) %>%
ggplot(aes(Species, Sepal.Width)) +
geom_boxplot()
What if we wanted to order the species based on their Sepal.Length? The forcats
package (part of tidyverse
) has a very useful function for this, fct_reorder()
!
iris %>%
mutate(Species = fct_reorder(Species, Sepal.Width)) %>%
ggplot(aes(Species, Sepal.Width)) +
geom_boxplot()
Sometimes, we might want to order our data based on the values in one column. For this, we can use arrange()
:
iris %>%
arrange(Petal.Width)
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4.90 3.10 1.50 0.100 setosa
## 2 4.80 3.00 1.40 0.100 setosa
## 3 4.30 3.00 1.10 0.100 setosa
## 4 5.20 4.10 1.50 0.100 setosa
## 5 4.90 3.60 1.40 0.100 setosa
## 6 5.10 3.50 1.40 0.200 setosa
## 7 4.90 3.00 1.40 0.200 setosa
## 8 4.70 3.20 1.30 0.200 setosa
## 9 4.60 3.10 1.50 0.200 setosa
## 10 5.00 3.60 1.40 0.200 setosa
## # ... with 140 more rows
The default is to order values from low to high. If we want to order from high to low, we have to specify it using desc()
:
iris %>%
arrange(desc(Petal.Width))
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 6.30 3.30 6.00 2.50 virginica
## 2 7.20 3.60 6.10 2.50 virginica
## 3 6.70 3.30 5.70 2.50 virginica
## 4 5.80 2.80 5.10 2.40 virginica
## 5 6.30 3.40 5.60 2.40 virginica
## 6 6.70 3.10 5.60 2.40 virginica
## 7 6.40 3.20 5.30 2.30 virginica
## 8 7.70 2.60 6.90 2.30 virginica
## 9 6.90 3.20 5.70 2.30 virginica
## 10 7.70 3.00 6.10 2.30 virginica
## # ... with 140 more rows
dplyr
has many many useful functions to manipulate tablesleft_join()
, right_join()
, full_join()
, inner_join()
and anti_join()
gather()
spread()
arrange()
separate()
factor()
and fct_reorder()
dplyr
functions we did not cover, but two additional useful ones are:
rename()
to change column namestop_n()
to filter the table based on the top/bottom entries of a column