This document is aimed at beggining R users that start by learning “tidyverse” functions. It shows how some of the tasks done with “tidyverse” functions have a corresponding solution using “base R” syntax (using functions that are part of the core packages deployed with R).
It is important to understand that there are many different ways to achieve the same thing, particularly in base R. I have tried to use simpler syntax where possible. For the more complicated cases, I have based my solutions on top hits from web searches.
It is also worth remembering that when you look for help online, you might find solutions using “base R” syntax. If you are used to “tidyverse” packages, it will help to add keywords such as “dplyr” and “tidyverse” to your search.
There are many other resources available online, in particular several cheatsheets that might be useful as a reference: https://www.rstudio.com/resources/cheatsheets/
Load the packages needed for these examples:
library(dplyr)
library(tidyr)
The datasets used in these examples come pre-loaded with R. To find more about them check their help pages:
?iris
?mtcars
?Indometh
tidyverse
select(iris, Species, Petal.Width) # by name
select(iris, 5, 4) # by column index
base R
iris[, c("Species", "Petal.Width")] # by name
iris[, c(5, 4)] # by column index
tidyverse
mutate(iris,
Petal.Ratio = Petal.Length/Petal.Width,
Sepal.Ratio = Sepal.Length/Sepal.Width)
base R
iris$Petal.Ratio <- iris$Petal.Length/iris$Petal.Width
iris$Sepal.Ratio <- iris$Sepal.Length/iris$Sepal.Width
tidyverse
filter(iris, Petal.Width > 0.5 & Species == "setosa")
base R
# Using [,]
iris[iris$Petal.Width > 0.5 & iris$Species == "setosa", ]
# Using subset (works very much like dplyr::filter)
subset(iris, Petal.Width > 0.5 & Species == "setosa")
tidyverse
# descending order of species (alphabetic) followed by ascending order of Petal.Width
arrange(iris, desc(Species), Petal.Width)
base R
# descending order of species (alphabetic) followed by ascending order of Petal.Width
iris[order(rev(iris$Species), iris$Petal.Width) , ]
tidyverse
# Generic way
summarise(iris,
Petal.Length.mean = mean(Petal.Length),
Petal.Length.sd = sd(Petal.Length),
Sepal.Length.mean = mean(Sepal.Length),
Sepal.Length.sd = sd(Sepal.Length))
# Shortcut when same functions applied to same variables
summarise_at(iris,
.vars = c("Petal.Length", "Sepal.Length"),
.funs = c("mean", "sd"))
(see also variants summarise_all()
and summarise_if()
)
base R
There are many ways to do this with base R, here’s a couple of possibilities:
# Manually create a data.frame
data.frame(Petal.Length.mean = mean(iris$Petal.Length),
Petal.Length.sd = sd(iris$Petal.Length),
Sepal.Length.mean = mean(iris$Sepal.Length),
Sepal.Length.sd = sd(iris$Sepal.Length))
# Use the "aggregate" function
## Column names might have to be changed afterwards
aggregate(formula = c(Sepal.Length, Petal.Length) ~ 1,
data = iris,
FUN = function(x) c(mean = mean(x), sd = sd(x)))
Consider the following tables
band_members
## # A tibble: 3 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 x 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
band_instruments2
## # A tibble: 3 x 2
## artist plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
tidyverse
# Retain rows with matches in both tables
inner_join(band_members, band_instruments, by = "name")
# Retain all rows:
full_join(band_members, band_instruments, by = "name")
# Retain all rows from first table:
left_join(band_members, band_instruments, by = "name")
# Retain all rows from second table:
right_join(band_members, band_instruments, by = "name")
Columns used for merging can have different names between tables:
inner_join(band_members, band_instruments2, by = c("name" = "artist"))
base R
# Retain rows with matches in both tables
merge(band_members, band_instruments, by = "name")
# Retain all rows:
merge(band_members, band_instruments, by = "name", all = TRUE)
# Retain all rows from first table:
merge(band_members, band_instruments, by = "name", all.x = TRUE)
# Retain all rows from second table:
merge(band_members, band_instruments, by = "name", all.y = TRUE)
Columns used for merging can have different names between tables:
merge(band_members, band_instruments2, by.x = "name", by.y = "artist")
Important note: with dplyr
, grouped operations are initiated with the function group_by()
. It is a good habit to use ungroup()
at the end of a series of grouped operations, otherwise the groupings will be carried in downstream analysis, which is not always desirable. In the examples below we follow this convention.
Note on base R: with base R there are several ways to do these types of operations. Here we show a generic “split-apply-combine” solution using the by()
function. These solutions require some understanding of list
objects and how to write custom functions.
tidyverse
mtcars %>%
group_by(cyl, gear) %>%
summarise(mpg.mean = mean(mpg),
mpg.sd = sd(mpg),
wt.mean = mean(wt),
wt.sd = sd(wt)) %>%
ungroup() # remove any groupings from downstream analysis
base R
# First operate in the data.frame by group (split-apply)
mtcars_by <- by(mtcars,
INDICES = list(mtcars$cyl, mtcars$gear),
FUN = function(x){
data.frame(cyl = unique(x$cyl),
gear = unique(x$gear),
mpg.mean = mean(x$mpg),
mpg.sd = sd(x$mpg),
wt.mean = mean(x$wt),
wt.sd = sd(x$wt))
})
# Then combine the results into a data.frame
do.call(rbind, mtcars_by)
Alternative solution:
# Using aggregate
aggregate(formula = cbind(mpg, wt) ~ cyl + gear,
data = mtcars,
FUN = function(x){
c(mean = mean(x), sd = sd(x))
})
For example, center the measurements of “Petal.Width” by subtracting the mean within species.
tidyverse
iris %>%
group_by(Species) %>%
mutate(Petal.Width.centered = Petal.Width - mean(Petal.Width)) %>%
ungroup() # remove any groupings from downstream analysis
base R
# First operate in the data.frame by group (split-apply)
iris_by <- by(iris,
INDICES = iris$Species,
FUN = function(x){
x$Petal.Width.centered <- x$Petal.Width - mean(x$Petal.Width)
return(x)
})
# Then combine the results into a data.frame
do.call(rbind, iris_by)
Alternative solution:
iris$Petal.Width.centered <- ave(iris$Petal.Width, iris$Species, FUN = function(x) x - mean(x))
iris
flowers with maximum “Petal.Width” for each “Species”.
tidyverse
iris %>%
group_by(Species) %>%
filter(Petal.Width == max(Petal.Width))
base R
# First operate in the data.frame by group (split-apply)
widest_petals <- by(iris,
INDICES = iris$Species,
FUN = function(x){
x[x$Petal.Width == max(x$Petal.Width), ]
})
# Then combine the results into a data.frame
do.call(rbind, widest_petals)
tidyverse
gather(iris, key = "trait", value = "measurement", Sepal.Length:Petal.Width)
base R
I couldn’t find an easy way to do this in a clean way, although reshape()
gets us close (but the “trait” variable is coded numerically):
reshape(iris,
varying = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
timevar = "trait",
idvar = "id",
v.names = "measurement",
direction = "long")
There is also stack()
, but we loose the “Species” column in the process.
stack(iris, select = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))
tidyverse
spread(Indometh, key = "time", value = "conc")
Note that in this case it might have been better to first modify the “key” column so that the resulting column names are not numeric:
Indometh %>%
mutate(time = paste("conc", time, sep = "_")) %>%
spread(key = "time", value = "conc")
base R
reshape(Indometh,
v.names = "conc",
idvar = "Subject",
timevar = "time",
direction = "wide")