In this module we are going to see how to manipulate, subset and modify tables using dplyr
, which is also a part of the tidyverse
package family.
We will learn how to:
First let’s load the tidyverse
package and read the data that we will work on:
library(tidyverse)
# Read data using read_csv
expt1 <- read_csv("../data/burghardt_et_al_2015_expt1.csv")
select()
To get an idea of data contents we will first check number of dimentsions with dim()
and look at the column names with names()
.
dim(expt1)
## [1] 957 15
names(expt1)
## [1] "genotype" "background" "temperature"
## [4] "fluctuation" "day.length" "vernalization"
## [7] "survival.bolt" "bolt" "days.to.bolt"
## [10] "days.to.flower" "rosette.leaf.num" "cauline.leaf.num"
## [13] "blade.length.mm" "total.leaf.length.mm" "blade.ratio"
This is quite a large dataset, we don’t want to look at it all at once. Say, we are interested in just 2 columns for now: genotype
and temperature
.
To select the columns we will use select()
:
select(expt1, genotype, temperature)
## # A tibble: 957 x 2
## genotype temperature
## <chr> <int>
## 1 Col Ama 12
## 2 Col Ama 12
## 3 Col Ama 12
## 4 Col Ama 12
## 5 Col Ama 12
## 6 Col Ama 12
## 7 Col Ama 12
## 8 Col Ama 12
## 9 Col Ama 12
## 10 Col Ama 12
## # ... with 947 more rows
Let’s say we wanted to select the column genotype
and any column that contains the word “bolt” in its name. select()
allows you to do this:
select(expt1, genotype, contains("bolt"))
## # A tibble: 957 x 4
## genotype survival.bolt bolt days.to.bolt
## <chr> <chr> <chr> <int>
## 1 Col Ama Y Y 28
## 2 Col Ama Y Y 29
## 3 Col Ama Y Y 31
## 4 Col Ama Y Y 31
## 5 Col Ama Y Y 32
## 6 Col Ama Y Y 33
## 7 Col Ama Y Y 34
## 8 Col Ama Y Y 35
## 9 Col Ama Y Y 69
## 10 Col Ama Y Y 72
## # ... with 947 more rows
Challenge: can you select all the columns that contain values measured in mm?
filter()
Let’s say that we wanted to retain only the rows of our table with data on vernalized plants. This is a job for the filter()
function.
Before the filtering we need to have a look at the possible values in the vernalization
column, there are only 2 possibilities: ‘V’ and ‘NV’.
unique(expt1$vernalization)
## [1] "NV" "V"
(note: to select a single column from a table we can use the $
notation like above)
Since we are intereseted in vernalized plants only, we want to filter only those rows with the value “V”:
filter(expt1, vernalization == "V")
## # A tibble: 330 x 15
## genotype background temperature fluctuation day.length vernalization
## <chr> <chr> <int> <chr> <int> <chr>
## 1 Col Ama Col 12 Con 16 V
## 2 Col Ama Col 12 Con 16 V
## 3 Col Ama Col 12 Con 16 V
## 4 Col Ama Col 12 Con 16 V
## 5 Col Ama Col 12 Con 16 V
## 6 Col Ama Col 12 Con 16 V
## 7 Col Ama Col 12 Con 16 V
## 8 Col Ama Col 12 Con 16 V
## 9 Col Ama Col 12 Con 8 V
## 10 Col Ama Col 12 Con 8 V
## # ... with 320 more rows, and 9 more variables: survival.bolt <chr>,
## # bolt <chr>, days.to.bolt <int>, days.to.flower <int>,
## # rosette.leaf.num <int>, cauline.leaf.num <int>, blade.length.mm <dbl>,
## # total.leaf.length.mm <dbl>, blade.ratio <dbl>
To set the filtering conditions, several operators can be used:
Operator | Meaning | Example |
---|---|---|
< |
less than | filter(expt1, days.to.bolt < 20) |
<= |
less than or equal to | filter(expt1, days.to.bolt <= 20) |
> |
more than | filter(expt1, days.to.bolt > 20) |
>= |
more than or equal to | filter(expt1, days.to.bolt >= 20) |
== |
equal to | filter(expt1, days.to.bolt == 20) |
!= |
not equal to | filter(expt1, days.to.bolt != 20) |
%in% |
is contained in | filter(expt1, genotype %in% c("Col FRI", "Ler-1")) |
It is also possible to combine several conditions together using the following operators:
Operator | Meaning | Example |
---|---|---|
& |
AND | filter(expt1, days.to.bolt == 20 & genotype == "Ler-1") |
| |
OR | filter(expt1, rosette.leaf.num < 8 | rosette.leaf.num > 100) |
We can also identify missing values (NA
) using the is.na()
function and its negation (using !
):
Condition | Meaning | Example |
---|---|---|
is.na() |
is missing | filter(expt1, is.na(rosette.leaf.num)) |
!is.na() |
is not missing | filter(expt1, !is.na(rosette.leaf.num)) |
For example, if we want to select individuals that were vernalized and grown under fluctuating temperature conditions, we could do:
filter(expt1, vernalization == "V" & fluctuation == "Var")
Another example, if we wanted to select individuals grown with 8h day light or that were late flowering:
filter(expt1, day.length == "8" | days.to.bolt > 85)
Challenge: Can you filter the data to keep the following?
- Samples that are not in the the Ler background and that have been treated with fluctuating temperature.
- Samples that bolted in less than 57 days and with less than 40 leaves
- Samples of genotype
fca-6
for whichblade.ratio
is not missing
mutate()
This is a useful function to add new variables (i.e. new columns) to the table, or to modify existing ones.
Say, we want to add a new column, blade.length.cm
which will contain blade length in cm units. We can construct this column by using the already existing blade.length.mm
column:
# Create a new table with an extra column in relation to the previous table
expt1.cm <- mutate(expt1, blade.length.cm = blade.length.mm/10)
# Check that there's now a new column
colnames(expt1.cm)
## [1] "genotype" "background" "temperature"
## [4] "fluctuation" "day.length" "vernalization"
## [7] "survival.bolt" "bolt" "days.to.bolt"
## [10] "days.to.flower" "rosette.leaf.num" "cauline.leaf.num"
## [13] "blade.length.mm" "total.leaf.length.mm" "blade.ratio"
## [16] "blade.length.cm"
We can also modify/create more than one variable at a time, separating them by ,
:
# Make both blade.length and total.leaf.length in cm:
expt1.cm <- mutate(expt1,
blade.length.cm = blade.length.mm/10,
total.leaf.length.cm = total.leaf.length.mm/10)
# Check that those columns were now created
names(expt1.cm)
## [1] "genotype" "background" "temperature"
## [4] "fluctuation" "day.length" "vernalization"
## [7] "survival.bolt" "bolt" "days.to.bolt"
## [10] "days.to.flower" "rosette.leaf.num" "cauline.leaf.num"
## [13] "blade.length.mm" "total.leaf.length.mm" "blade.ratio"
## [16] "blade.length.cm" "total.leaf.length.cm"
Be careful! If we use the name of an existing variable, it will modify it instead.
%>%
pipes“Pipes” allows us to perform a sequence of operations, without the need for intermediate files (or very complicated nested commands).
Let’s say that we wanted to do several manipulations in our table:
Based on what we’ve learned so far, we could for example do the following:
# Create new column with mutate
expt1.filtered <- mutate(expt1,
total.leaf.length.cm = total.leaf.length.mm/10)
# Filter the table to retain only vernalised plants
expt1.filtered <- filter(expt1.filtered,
vernalization == "V")
# Select only the columns of interest
expt1.filtered <- select (expt1.filtered,
genotype,
total.leaf.length.cm )
dplyr
allows us to “chain” several commands using a special “pipe” function %>%
. This is how the above code could have been written using this trick:
expt1 %>%
mutate(total.leaf.length.cm = total.leaf.length.mm/10) %>%
filter(vernalization == "V") %>%
select(genotype, total.leaf.length.cm)
## # A tibble: 330 x 2
## genotype total.leaf.length.cm
## <chr> <dbl>
## 1 Col Ama 2.21
## 2 Col Ama 2.68
## 3 Col Ama 2.79
## 4 Col Ama 2.28
## 5 Col Ama 2.49
## 6 Col Ama 2.22
## 7 Col Ama 2.41
## 8 Col Ama 2.14
## 9 Col Ama 3.60
## 10 Col Ama 2.99
## # ... with 320 more rows
What the %>%
pipe does is take the output of the thing on the left-side of it and pass it on to the function that comes after it.
Challenge: Using
%>%
to do the following onexpt1
:
- Keep samples that are not in the the Ler background and that have been treated with fluctuating temperature.
- Transform the blade length in cm
- Keep only the columns containing information about the genotype, blade length (in cm) and bolting information (hint: use
contains()
)
We can pipe data manipulation to ggplot
to produce graphs for a subset of the data. This is very useful for interactive data exploration (it avoids having to constantly create new variables with <-
).
Let’s say we want to make a boxplot of the flowering time in function of the temperature fluctuation (like in last session), however only for the vernalised samples. For this, we can use the following code:
expt1 %>%
filter(vernalization == "V") %>%
ggplot(aes(fluctuation, days.to.flower)) +
geom_boxplot()
## Warning: Removed 15 rows containing non-finite values (stat_boxplot).
Challenge: Can you make a violin plot of the days to flower for the different genotypes that have been growing in short days (8h light)?
group_by
and summarise
Sometimes we might want to summarise our data into a smaller table.
This kind of operation can be achieved by combining two dplyr
functions: group_by()
and summarise()
. Please note, grouping with group_by()
doesn’t change how the data looks (apart from listing how it’s grouped). We can then do summaries based on those groupings with summarise()
.
For example, we can summarise the mean flowering time for each genotype:
expt1 %>%
group_by(genotype) %>%
summarise(mean.days.to.flower = mean(days.to.flower, na.rm = TRUE))
## # A tibble: 10 x 2
## genotype mean.days.to.flower
## <chr> <dbl>
## 1 Col Ama 55.82836
## 2 Col FRI 70.19167
## 3 fca-6 70.45000
## 4 flc-3 FRI 53.13333
## 5 flk-1 77.55769
## 6 fve-3 81.25000
## 7 ld-1 89.76471
## 8 Ler-1 55.64706
## 9 prmt5 FRI 88.67925
## 10 vin3-4 FRI 95.67000
You can see that the output contains two columns: genotype
, which was the column used to “group” the original table and mean.days.to.flower
, which was the column we created with the summarise
function. And there’s only 10 rows of this new table, which corresponds to one row for each genotype.
You can group the table by more than one variable, and you can also make more than one summary operation on the table. For example, we might want to calculate both the mean, median and standard deviation of the days to flower of each genotype grown in long and short days:
expt1 %>%
group_by(genotype, day.length) %>%
summarise(mean.days.flower = mean(days.to.flower, na.rm = TRUE),
sd.days.flower = sd(days.to.flower, na.rm = TRUE),
median.days.flower = median(days.to.flower, na.rm = TRUE))
## # A tibble: 20 x 5
## # Groups: genotype [?]
## genotype day.length mean.days.flower sd.days.flower
## <chr> <int> <dbl> <dbl>
## 1 Col Ama 8 75.75714 17.434295
## 2 Col Ama 16 34.03125 8.801098
## 3 Col FRI 8 90.63158 23.835471
## 4 Col FRI 16 51.69841 17.660449
## 5 fca-6 8 97.03571 17.108547
## 6 fca-6 16 47.18750 10.007860
## 7 flc-3 FRI 8 71.45070 17.647977
## 8 flc-3 FRI 16 32.81250 8.840518
## 9 flk-1 8 113.52381 8.133997
## 10 flk-1 16 53.19355 8.510462
## 11 fve-3 8 121.31250 7.913859
## 12 fve-3 16 61.21875 10.941883
## 13 ld-1 8 130.57895 10.905227
## 14 ld-1 16 65.53125 11.199033
## 15 Ler-1 8 72.88889 16.890873
## 16 Ler-1 16 36.25000 10.267393
## 17 prmt5 FRI 8 106.29787 23.934616
## 18 prmt5 FRI 16 74.64407 25.953382
## 19 vin3-4 FRI 8 142.00000 23.666603
## 20 vin3-4 FRI 16 67.27419 13.880248
## # ... with 1 more variables: median.days.flower <dbl>
Now, you can see there’s 20 rows in that table. This is because we have one row for each combination of genotype (there’s 10 of them) and day.length (8h and 16h).
It is often useful to find how many observations we have for each group.
In the example above, let’s say we wanted to know how many plants we have of each genotype. For this we can use the special function n()
, which will count the number of rows of each group in the original table:
expt1 %>%
group_by(genotype) %>%
summarise(mean.days.to.flower = mean(days.to.flower, na.rm = TRUE),
n.obs = n())
## # A tibble: 10 x 3
## genotype mean.days.to.flower n.obs
## <chr> <dbl> <int>
## 1 Col Ama 55.82836 135
## 2 Col FRI 70.19167 128
## 3 fca-6 70.45000 64
## 4 flc-3 FRI 53.13333 136
## 5 flk-1 77.55769 60
## 6 fve-3 81.25000 60
## 7 ld-1 89.76471 60
## 8 Ler-1 55.64706 68
## 9 prmt5 FRI 88.67925 125
## 10 vin3-4 FRI 95.67000 121
Please note, you can use n()
only inside summarise()
function. There is a similar function, count()
that can be used independently and piped in. Because counting the number of observations in a group is such a common operation, there’s actually a special function, count()
just to do that one thing:
expt1 %>%
count(genotype, day.length)
## # A tibble: 20 x 3
## genotype day.length n
## <chr> <int> <int>
## 1 Col Ama 8 71
## 2 Col Ama 16 64
## 3 Col FRI 8 64
## 4 Col FRI 16 64
## 5 fca-6 8 32
## 6 fca-6 16 32
## 7 flc-3 FRI 8 72
## 8 flc-3 FRI 16 64
## 9 flk-1 8 28
## 10 flk-1 16 32
## 11 fve-3 8 28
## 12 fve-3 16 32
## 13 ld-1 8 28
## 14 ld-1 16 32
## 15 Ler-1 8 36
## 16 Ler-1 16 32
## 17 prmt5 FRI 8 61
## 18 prmt5 FRI 16 64
## 19 vin3-4 FRI 8 57
## 20 vin3-4 FRI 16 64
Often, we want to apply the same set of summary functions to several variables. For example, in our dataset we might want to group the data by genotype and treatment groups and then calculate the mean and standard deviation for all the traits measured. This kind of operation can be achieved using the summarise_at
function.
Here is how it works:
# Summarise all traits by treatment groups and genotype
expt1 %>%
group_by(genotype, background, temperature, fluctuation, day.length, vernalization) %>%
summarise_at(vars(days.to.bolt, days.to.flower, rosette.leaf.num, cauline.leaf.num, blade.length.mm, total.leaf.length.mm, blade.ratio),
funs(mean, sd)) %>%
ungroup()
## # A tibble: 120 x 20
## genotype background temperature fluctuation day.length vernalization
## <chr> <chr> <int> <chr> <int> <chr>
## 1 Col Ama Col 12 Con 8 NV
## 2 Col Ama Col 12 Con 8 V
## 3 Col Ama Col 12 Con 16 NV
## 4 Col Ama Col 12 Con 16 V
## 5 Col Ama Col 12 Var 8 NV
## 6 Col Ama Col 12 Var 8 V
## 7 Col Ama Col 12 Var 16 NV
## 8 Col Ama Col 12 Var 16 V
## 9 Col Ama Col 22 Con 8 NV
## 10 Col Ama Col 22 Con 8 V
## # ... with 110 more rows, and 14 more variables: days.to.bolt_mean <dbl>,
## # days.to.flower_mean <dbl>, rosette.leaf.num_mean <dbl>,
## # cauline.leaf.num_mean <dbl>, blade.length.mm_mean <dbl>,
## # total.leaf.length.mm_mean <dbl>, blade.ratio_mean <dbl>,
## # days.to.bolt_sd <dbl>, days.to.flower_sd <dbl>,
## # rosette.leaf.num_sd <dbl>, cauline.leaf.num_sd <dbl>,
## # blade.length.mm_sd <dbl>, total.leaf.length.mm_sd <dbl>,
## # blade.ratio_sd <dbl>
Note: It is important to say that, after you use the group_by()
function, the table remains grouped by the variables you defined. This means that any operations you do afterwards will use this grouping, which might not be desirable! There is a function called ungroup()
, which is usually good to use after you finish your operations.
Once we have modified our data we might want to save it to come back to it later. For this we can use the function write_csv()
Let’s say that we were to remove some missing data from our data to share it with a collaborator (or later use in R ourselves):
expt1.blade.ratio.clean <- expt1 %>%
filter(!is.na(blade.ratio))
#alternatively, we can use drop_na() function that we used in the module02:
expt1.blade.ratio.clean2 <- expt1 %>% drop_na(blade.ratio)
write_csv(expt1.blade.ratio.clean, path = "../data/expt1_blade_ratio_clean.csv")
dplyr
has many many useful functions to manipulate tablesdplyr
functions can be “chained” together using the %>%
pipewrite_*
family of functions