back to course materials

1 Intro


1.1 Variables

What is the value of mass_index after running these commands?

mass <- 47.5
age  <- 100
mass_index <- mass/age
mass <- mass * 2.0

What type of vector do you get in each of the following cases (hint: use class())?

num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")

1.2 Conditional subsetting

Using this vector:

heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
  • create a new vector with missing values removed
# !is.na() is used to ask the question "Is NOT missing?"
heights_no_na <- heights[!is.na(heights)]
  • calculate the median of heights
# The median function has an option called "na.rm"
## note that not all functions have this option! Always check the help page of the functions
median(heights, na.rm = TRUE)

# Alternatively, we could use the version we created without missing values
median(heights_no_na)
  • calculate how many values are above 67
# Because logical values are stored as TRUE = 1 and FALSE = 0, if we sum a 
## logical vector it's equivalent to counting cases where this was true 
sum(heights_no_na > 67)

1.3 Basic data.frame manipulation

  • How many rows and columns does surveys have? (hint: functions ncol() and nrow())
nrow(surveys)
ncol(surveys)
  • What type of variables do you have in the table? (hint: function glimpse())
glimpse(surveys)
  1. How many missing values of weight are there? (remember you can use sum() on a logical vector)
    • extra: what proportion of values are missing?
# We can sum the output of the is.na function, which is a logical vector
sum(is.na(surveys$weight))

# Extra: divide by total number of observations
sum(is.na(surveys$weight))/nrow(surveys)
  1. How many unique genera were sampled? (hint: functions unique() and length())
length(unique(surveys$genus))

2 Manipulating data frames with dplyr


2.1 filter and select

Using pipes, subset the surveys data to include animals collected before 1995 and retain only the columns year, sex, and weight.

# Take the surveys table, then filter for year < 1995, then select relevant columns
surveys_pre1995 <- surveys %>% 
  filter(year < 1995) %>% 
  select(year, sex, weight)

The answer should have 21486 rows of data.

optional:

  • How many animals of taxa “Reptile” were collected in the 1980’s?
surveys %>% 
  filter(year >= 1980 & year < 1990 & taxa == "Reptile")
  • Why does this code throw an error? Can you modify it to give the desired output?
# Get genus and plot_id of animals collected in 1995
surveys %>% 
  select(genus, plot_id) %>% 
  filter(year == 1995)
  • Subset the table to retain animals collected in the earliest year of the dataset.
# Filter to retain cases where the values of year are equal to the minimum of that same year column
surveys %>% 
  filter(year == min(year))

2.2 mutate, filter and select

Create a new data frame from the surveys data that meets the following criteria: contains only the species_id column and a new column called hindfoot_half containing values that are half the hindfoot_length values. In this hindfoot_half column, there are no NAs and all values are less than 30.

Hint: think about how the commands should be ordered to produce this data frame!

surveys_hindfoot_half <- surveys %>%
    filter(!is.na(hindfoot_length)) %>%
    mutate(hindfoot_half = hindfoot_length / 2) %>%
    filter(hindfoot_half < 30) %>%
    select(species_id, hindfoot_half)

The answer should have 31436 rows and 2 columns.

optional:

  • Create a new column in the table that contain the ratio between the weight and hindfoot length. We’re then only interested in cases where the base 2 logarithm of this ratio (hint: see log2() function) is either greater than 2 or smaller than -2. The result should have 805 rows.
surveys %>% 
  mutate(weight_hind_ratio = weight/hindfoot_length) %>% 
  filter(log2(weight_hind_ratio) > 2 | log2(weight_hind_ratio) < -2)

2.3 grouped summaries

  • How many animals were caught in each plot_type surveyed?
# Use the count function to count how many in each plot_type
surveys %>%
    count(plot_type) 
  • Make a summary table with the mean, minimum and maximum hindfoot length of each species. Include the number of individuals observed per species as well as the number of those individuals that had non-missing hindfoot length (hint: remember you can use sum(!is.na(x)) to count how many non-missing values there are in a vector).
# for each species
# calculate mean, min and max, 
# number of observations per species 
# and number of non-missing hindfoot_length values
surveys %>%
  group_by(species_id) %>%
  summarize(
    mean_hindfoot_length = mean(hindfoot_length, na.rm = TRUE),
    min_hindfoot_length = min(hindfoot_length, na.rm = TRUE),
    max_hindfoot_length = max(hindfoot_length, na.rm = TRUE),
    n = n(),
    n_hindfoot = sum(!is.na(hindfoot_length))
  )
  • What was the heaviest animal measured in each year? Return the columns year, genus, species_id and weight.
surveys %>%
  # remove missing values
  filter(!is.na(weight)) %>%
  # for each year
  group_by(year) %>%
  # retain those where the weight value is equal to the maximum weight value
  filter(weight == max(weight)) %>%
  # select only few columns
  select(year, genus, species, weight) %>%
  # then sort the table by year
  arrange(year)

2.4 Data cleaning

  • Create a new data frame called surveys_complete, which:
    • has no missing values for weight, hindfoot_length and sex
    • contains only species that were measured in at least 50 individuals
# First filter away the missing values
surveys_complete <- surveys %>% 
  filter(!is.na(weight) & !is.na(sex) & !is.na(hindfoot_length))

# Add a column with the count of each species 
surveys_complete <- surveys_complete %>% 
  group_by(species_id) %>% 
  mutate(n_obs = n())

# Now filter the table so that number of observations is >= 50
surveys_complete <- surveys_complete %>% 
  filter(n_obs >= 50)

# Or all of the above could have been done in one pipeline
surveys_complete <- surveys %>% 
  filter(!is.na(weight) & !is.na(sex) & !is.na(hindfoot_length)) %>% 
  group_by(species_id) %>% 
  mutate(n_obs = n()) %>% 
  filter(n_obs >= 50)

# note the course notes shows a different way of doing this:
# https://datacarpentry.org/R-ecology-lesson/03-dplyr.html#exporting_data

The final data frame should have 30463 rows.

  • Optional: save the new table as a CSV file in the data_output folder of your project directory (use write_csv() function).
write_csv(surveys_complete, "data_output/surveys_complete.csv")

3 Plotting with ggplot2


3.1 distributions

Boxplots are useful summaries, but hide the shape of the distribution. For example, if the distribution is bimodal, we would not see it in a boxplot. An alternative to the boxplot is the violin plot, where the shape (of the density of points) is drawn.

  • Use geom_violin() to create violin plots of weight distributions for each genus
surveys_complete %>% 
  ggplot(aes(genus, weight)) +
  geom_violin()
  • Some individuals are much heavier than others. Using a log scale in this case would help. Try adding a “scale” layer with scale_y_log10().
surveys_complete %>% 
  ggplot(aes(genus, weight)) +
  geom_violin() +
  scale_y_log10()
  • Can you overlay a boxplot with outliers removed (check ?geom_boxplot help)?
surveys_complete %>% 
  ggplot(aes(genus, weight)) +
  geom_violin() +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  scale_y_log10()
  • Try and recreate the picture below, showing the distribution of weights by species_id, with violin plots coloured by genus.
surveys_complete %>% 
  ggplot(aes(species_id, weight)) +
  geom_violin(aes(fill = genus)) +
  geom_boxplot(width = 0.2, outlier.shape = NA) +
  scale_y_log10()