The most fundamental data structures in R are:
Homogeneous | Heterogeneous | |
---|---|---|
1-d | atomic vector | list |
2-d | matrix | data.frame |
n-d | array |
(source)
Atomic vectors are often simply referred to as vectors. They are one-dimensional and homogeneous (can only contain values of one type).
There are four common types of vector:
# Integer
<- c(1L, 10L)
x_int class(x_int)
## [1] "integer"
# Numeric with double-precision digits
<- c(1, 2.1, 5)
x_dbl class(x_dbl)
## [1] "numeric"
# Character
<- c("cat", "dog", "fish")
x_chr class(x_chr)
## [1] "character"
# Logical
<- c(TRUE, FALSE, TRUE)
x_lgl class(x_lgl)
## [1] "logical"
Note that, by default, when creating a vector of numbers, unless we use the “L” suffix after the number, R will create a numeric vector. (Why “L” as the suffix? The answer isn’t clear.)
Because vectors are homogeneous, coercion occurs when there are mixed types:
# Coerced to character
<- c("a", 1, 5)
x_mixed class(x_mixed)
## [1] "character"
A matrix is a 2-dimensional object (rows and columns), and like atomic vectors it’s an homogenous object (can only contain values of one type).
<- matrix(1:12, nrow = 3, ncol = 4)
x_mat x_mat
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Subset can be with with square-bracket notation
(x[rows, columns]
):
1:2, 2:3] x_mat[
## [,1] [,2]
## [1,] 4 7
## [2,] 5 8
Matrix objects often have column and row names:
rownames(x_mat) <- c("gene1", "gene2", "gene3")
colnames(x_mat) <- c("sample1", "sample2", "sample3", "sample4")
x_mat
## sample1 sample2 sample3 sample4
## gene1 1 4 7 10
## gene2 2 5 8 11
## gene3 3 6 9 12
A common example of the use of matrix objects in bioinformatics is in processed data from RNA sequencing. These data are often stored as a genes-by-samples matrix of counts (quantifying the sequencing reads that aligned to each gene in each sample).
There are several functions optimised to calculate common summary
statistics across rows or columns of matrix objects. Some are part of
base R, others from the matrixStats
package.
library(matrixStats)
# row summaries
rowSums(x_mat) # sum
rowMeans(x_mat) # mean
rowSds(x_mat) # standard deviation
rowVars(x_mat) # variance
rowIQRs(x_mat) # inter-quartile range
rowMads(x_mat) # mean absolute deviation
# column (sample) summaries
colSums(x_mat) # sum
colMeans(x_mat) # mean
colSds(x_mat) # standard deviation
colVars(x_mat) # variance
colIQRs(x_mat) # inter-quartile range
colMads(x_mat) # mean absolute deviation
By default, matrix operations such as +
, -
,
*
and /
are done element-wise (or in a
vectorised manner). That is, when we take our matrix from above:
## sample1 sample2 sample3 sample4
## gene1 1 4 7 10
## gene2 2 5 8 11
## gene3 3 6 9 12
And multiply it by a vector of length 4:
* c(1, 10, 100, 1000) x_mat
## sample1 sample2 sample3 sample4
## gene1 1 4000 700 100
## gene2 20 5 8000 1100
## gene3 300 60 9 12000
R will go element-by-element through the columns of the matrix and multiply the values “recyling” the second vector as many times as it needs:
1 x 1 | 4 x 1000 | 7 x 100 |
2 x 10 | 5 x 1 | 8 x 1000 |
3 x 100 | 6 x 10 | 9 x 1 |
If what we want is to do matrix multiplication then we have
to use the operator %*%
:
%*% c(1, 10, 100, 1000) x_mat
## [,1]
## gene1 10741
## gene2 11852
## gene3 12963
Another common task when manipulating matrices is to transpose them.
For this, we can use the t()
function:
t(x_mat)
## gene1 gene2 gene3
## sample1 1 2 3
## sample2 4 5 6
## sample3 7 8 9
## sample4 10 11 12
Arrays are an N-dimensional generalisation of a matrix, and therefore are also homogeneous (can only contain values of one type).
<- array(1:24, dim = c(3, 4, 2))
x_arr x_arr
## , , 1
##
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
##
## , , 2
##
## [,1] [,2] [,3] [,4]
## [1,] 13 16 19 22
## [2,] 14 17 20 23
## [3,] 15 18 21 24
A common example of the use of arrays is to store image data (e.g. red, blue and green pixel intensity channels):
Lists are one of the more versatile types of object in R. The elements of a list can be of any type (even other lists), therefore they are heterogeneous objects. They are also referred to as a recursive vector.
<- list(a = 1:3,
x_lst b = data.frame(x = 0:3, y = 10:13))
x_lst
## $a
## [1] 1 2 3
##
## $b
## x y
## 1 0 10
## 2 1 11
## 3 2 12
## 4 3 13
Access elements in a list either by name (using $
) or by
position (using [[
):
$a x_lst
## [1] 1 2 3
1]] x_lst[[
## [1] 1 2 3
Note that x[1]
would return a list with one
element whereas x[[1]]
“pulls” the actual value from the
list and, in this example, returns a vector.
class(x_lst[1])
## [1] "list"
class(x_lst[[1]])
## [1] "integer"
Many functions in R return objects that are, in essence, a list. For
example, the lm()
function, used to fit linear models,
returns a list object (although the object has class “lm”):
<- lm(dist ~ speed, data = cars)
lm_example class(lm_example)
## [1] "lm"
str(lm_example) # structure of the object
## List of 12
## $ coefficients : Named num [1:2] -17.58 3.93
## ..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
## $ residuals : Named num [1:50] 3.85 11.85 -5.95 12.05 2.12 ...
## ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
## $ effects : Named num [1:50] -303.914 145.552 -8.115 9.885 0.194 ...
## ..- attr(*, "names")= chr [1:50] "(Intercept)" "speed" "" "" ...
## $ rank : int 2
## $ fitted.values: Named num [1:50] -1.85 -1.85 9.95 9.95 13.88 ...
## ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
## $ assign : int [1:2] 0 1
## $ qr :List of 5
## ..$ qr : num [1:50, 1:2] -7.071 0.141 0.141 0.141 0.141 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:50] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:2] "(Intercept)" "speed"
## .. ..- attr(*, "assign")= int [1:2] 0 1
## ..$ qraux: num [1:2] 1.14 1.27
## ..$ pivot: int [1:2] 1 2
## ..$ tol : num 1e-07
## ..$ rank : int 2
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 48
## $ xlevels : Named list()
## $ call : language lm(formula = dist ~ speed, data = cars)
## $ terms :Classes 'terms', 'formula' language dist ~ speed
## .. ..- attr(*, "variables")= language list(dist, speed)
## .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:2] "dist" "speed"
## .. .. .. ..$ : chr "speed"
## .. ..- attr(*, "term.labels")= chr "speed"
## .. ..- attr(*, "order")= int 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(dist, speed)
## .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
## $ model :'data.frame': 50 obs. of 2 variables:
## ..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
## ..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' language dist ~ speed
## .. .. ..- attr(*, "variables")= language list(dist, speed)
## .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:2] "dist" "speed"
## .. .. .. .. ..$ : chr "speed"
## .. .. ..- attr(*, "term.labels")= chr "speed"
## .. .. ..- attr(*, "order")= int 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(dist, speed)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
## .. .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
## - attr(*, "class")= chr "lm"
The data.frame
object is the most common object to
represent tabular data (usually imported using read.csv()
and related functions).
It is a 2-dimensional object but, unlike matrices, it is
heterogeneous (each column can be of a different type). The
data.frame
object is in fact a type of list, where
the elements (columns) are all of the same length.
Similarly to a list, we can extract elements by name or position:
$speed cars
## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
1]] cars[[
## [1] 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
Packages such as dplyr
(and the tidyverse
more generally) can be used to manipulate data.frame
objects.
When repeating a task, it can be useful to write our own custom functions.
Here is the basic syntax:
<- function(arguments){
my_function # code for what the function does
# return value
}
For example, let’s define a function that makes min-max scaling of a vector:
\(x' = \frac{x - min(x)}{max(x) - min(x)}\)
<- function(x){
scale_minmax <- (x - min(x))/(max(x) - min(x))
x_scaled return(x_scaled)
}
And test this function on a vector of numbers:
scale_minmax(1:10)
## [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
## [8] 0.7777778 0.8888889 1.0000000
When writing your own functions, it is a good practice to include as many checks as possible, to avoid possible unintended errors or (even worse) wrong answers.
Using if/else
statements can be very useful in these
circumstances. For example, here is how we could improve our previous
function:
<- function(x){
scale_minmax # check if x is numeric
if(!is.numeric(x)){
stop("x must be numeric.")
}# check if there are infinite values
if(any(is.infinite(x))){
warning("Infinite values present in the vector. This will result in NaN values.")
}
<- (x - min(x))/(max(x) - min(x))
x_scaled return(x_scaled)
}
We should now get a warning when there are infinite values, or an error if the vector is not character:
scale_minmax(c(1, 2, Inf))
## Warning in scale_minmax(c(1, 2, Inf)): Infinite values present in the vector.
## This will result in NaN values.
## [1] 0 0 NaN
Add a new argument to this option, to enable the user to ignore missing values when scaling the data.
Following the convention of other functions such as
mean()
, min()
, max()
, etc., we
can add an argument called na.rm
as follows:
<- function(x, na.rm = FALSE){
scale_minmax # check if x is numeric
if(!is.numeric(x)){
stop("x must be numeric.")
}# check if there are infinite values
if(any(is.infinite(x))){
warning("Infinite values present in the vector. This will result in NaN values.")
}
<- (x - min(x, na.rm = na.rm))/(max(x, na.rm = na.rm) - min(x, na.rm = na.rm))
x_scaled return(x_scaled)
}
We can now test our function:
scale_minmax(c(1, 2, NA))
## [1] NA NA NA
scale_minmax(c(1, 2, NA), na.rm = TRUE)
## [1] 0 1 NA
Repeating operations is a common task in any programming language.
The for()
loop can be a useful tool in these
situations.
Here is a simple example:
for (i in 1:3){
print(i)
}
## [1] 1
## [1] 2
## [1] 3
The basic principle of the for()
loop is that a variable
is created which will store a different value at each iteration of the
loop. In this case we are defining a variable called i
,
which will take the values 1, 2 and 3 in successive rounds of the
loop.
Often, we want to store the result of the operations done within the
loop in an object, so that we can analyse its results later on. Due to
its flexibility, we can use a list
for this purpose
(although an atomic vector may also be suitable in some cases).
For example, let’s sample 10 numbers from a normal distribution and repeat that 5 times:
# number of sampling repeats
<- 5
N
# create an empty list of length equal to the number of repeats
<- vector(mode = "list", length = N)
result
# perform our sampling N times
for (i in 1:N){
# sample from normal distribution and add to results list
<- rnorm(10, mean = 0, sd = 1)
result[[i]] }
The result is a list of length 5:
result
## [[1]]
## [1] -0.7032814 -0.7147582 1.8243861 2.1426399 -0.5198499 1.9226053
## [7] -0.4389941 -0.2235442 0.1176916 0.8642539
##
## [[2]]
## [1] -1.7220336 -0.1687946 0.6491491 -0.7460114 -1.0413941 0.3243318
## [7] -0.7257468 1.4558824 -0.2667898 0.1253643
##
## [[3]]
## [1] 0.51447597 1.53149371 -0.04340550 -0.12943045 0.63700766 0.20168534
## [7] -0.08620392 -0.06855164 1.31109949 0.48792483
##
## [[4]]
## [1] 0.31508393 -0.59964481 0.24411850 -0.70063314 -1.60712453 -0.62513205
## [7] 0.36343503 1.07721591 2.28328982 -0.06790124
##
## [[5]]
## [1] -0.90196187 0.19253706 0.85262083 -0.62637872 -0.27215829 0.15267985
## [7] -0.13196087 1.33199044 0.05465849 -1.56015841
Something to note in the code above is that we used the
vector()
function to create an empty list, whereas earlier
we had manually created a list with the list()
function.
This is a little confusing, but the thing to remember is that a list is
a type of vector - a recursive vector - and when we want to
create an empty list we have to use the vector()
function
as shown above. If we are manually creating a list with specific values,
then we can use the list()
function.
For example, here is how you could create different types of empty objects to store your results:
vector(mode = "numeric", length = 5)
- a numeric atomic
vectorvector(mode = "character", length = 5)
- a character
atomic vectorvector(mode = "logical", length = 5)
- a logical atomic
vectorvector(mode = "list", length = 5)
- a listWhen adding results to an object with a for loop, always create an empty object first with the right size for the number of iterations you will be doing. In the example above, we created an empty list of length 5, since we were repeating our operation 5 times. This will make your code run much faster.
Part I
Data about world-wide income per person between 1960 and 2010 was downloaded from the Gapminder foundation and stored as a series of files, containing the data for each country individually.
The data are stored in text files in a directory entitled “data/multiple_files”. There is one text file for each country. The format of the data in each text file is the same:
Create a function called read_gapminder
that reads in
the data from a single file and returns a data.frame object. The
function should:
read_csv()
function to find an
option that allows you to start reading a file not from the first line
but from a different line.read_lines()
function is useful to read each line
of a file and store it as a vector. You can try using it to read the
first line of the file, containing the world region.basename()
is useful to retrieve the name
of a file in a file path that contains directories.str_*()
family of functions, which can be
useful to remove text patterns from a string.Part II
Now that you have a function to read and tidy a single file, use a
for()
loop to read all the files into a list. Once you read
all the files, you may combine them into a single data.frame using the
bind_rows()
function.
list.files()
may be useful for you to
create a vector with the names of the files from the directory.Part III
In R it is not very common to encounter for loops.
This is because most R users use a functional approach
to programming where, instead of writing a for loop, they use functions
that use other functions to iterate through a list of values.
Read the section Loops
vs Functionals from the R4DS book and see if you can
replace your loop from the previous task with the map
function.
The answer to this exercise is in a separate document