Data Structures

The most fundamental data structures in R are:

Homogeneous Heterogeneous
1-d atomic vector list
2-d matrix data.frame
n-d array

(source)

Atomic Vectors

Atomic vectors are often simply referred to as vectors. They are one-dimensional and homogeneous (can only contain values of one type).

There are four common types of vector:

# Integer
x_int <- c(1L, 10L)
class(x_int)
## [1] "integer"
# Numeric with double-precision digits
x_dbl <- c(1, 2.1, 5)
class(x_dbl)
## [1] "numeric"
# Character
x_chr <- c("cat", "dog", "fish")
class(x_chr)
## [1] "character"
# Logical
x_lgl <- c(TRUE, FALSE, TRUE)
class(x_lgl)
## [1] "logical"

Note that, by default, when creating a vector of numbers, unless we use the “L” suffix after the number, R will create a numeric vector. (Why “L” as the suffix? The answer isn’t clear.)

Because vectors are homogeneous, coercion occurs when there are mixed types:

# Coerced to character
x_mixed <- c("a", 1, 5)
class(x_mixed)
## [1] "character"

Matrix

A matrix is a 2-dimensional object (rows and columns), and like atomic vectors it’s an homogenous object (can only contain values of one type).

x_mat <- matrix(1:12, nrow = 3, ncol = 4)
x_mat
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

Subset can be with with square-bracket notation (x[rows, columns]):

x_mat[1:2, 2:3]
##      [,1] [,2]
## [1,]    4    7
## [2,]    5    8

Matrix objects often have column and row names:

rownames(x_mat) <- c("gene1", "gene2", "gene3")
colnames(x_mat) <- c("sample1", "sample2", "sample3", "sample4")
x_mat
##       sample1 sample2 sample3 sample4
## gene1       1       4       7      10
## gene2       2       5       8      11
## gene3       3       6       9      12

A common example of the use of matrix objects in bioinformatics is in processed data from RNA sequencing. These data are often stored as a genes-by-samples matrix of counts (quantifying the sequencing reads that aligned to each gene in each sample).

Matrix Statistics

There are several functions optimised to calculate common summary statistics across rows or columns of matrix objects. Some are part of base R, others from the matrixStats package.

library(matrixStats)
# row summaries
rowSums(x_mat)  # sum
rowMeans(x_mat) # mean
rowSds(x_mat)   # standard deviation
rowVars(x_mat)  # variance
rowIQRs(x_mat)  # inter-quartile range
rowMads(x_mat)  # mean absolute deviation

# column (sample) summaries
colSums(x_mat)  # sum
colMeans(x_mat) # mean
colSds(x_mat)   # standard deviation
colVars(x_mat)  # variance
colIQRs(x_mat)  # inter-quartile range
colMads(x_mat)  # mean absolute deviation

Matrix Operations

By default, matrix operations such as +, -, * and / are done element-wise (or in a vectorised manner). That is, when we take our matrix from above:

##       sample1 sample2 sample3 sample4
## gene1       1       4       7      10
## gene2       2       5       8      11
## gene3       3       6       9      12

And multiply it by a vector of length 4:

x_mat * c(1, 10, 100, 1000)
##       sample1 sample2 sample3 sample4
## gene1       1    4000     700     100
## gene2      20       5    8000    1100
## gene3     300      60       9   12000

R will go element-by-element through the columns of the matrix and multiply the values “recyling” the second vector as many times as it needs:

1 x 1 4 x 1000 7 x 100
2 x 10 5 x 1 8 x 1000
3 x 100 6 x 10 9 x 1

If what we want is to do matrix multiplication then we have to use the operator %*%:

x_mat %*% c(1, 10, 100, 1000)
##        [,1]
## gene1 10741
## gene2 11852
## gene3 12963

Another common task when manipulating matrices is to transpose them. For this, we can use the t() function:

t(x_mat)
##         gene1 gene2 gene3
## sample1     1     2     3
## sample2     4     5     6
## sample3     7     8     9
## sample4    10    11    12

Arrays

Arrays are an N-dimensional generalisation of a matrix, and therefore are also homogeneous (can only contain values of one type).

x_arr <- array(1:24, dim = c(3, 4, 2))
x_arr
## , , 1
## 
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
## 
## , , 2
## 
##      [,1] [,2] [,3] [,4]
## [1,]   13   16   19   22
## [2,]   14   17   20   23
## [3,]   15   18   21   24

A common example of the use of arrays is to store image data (e.g. red, blue and green pixel intensity channels):

Lists

Lists are one of the more versatile types of object in R. The elements of a list can be of any type (even other lists), therefore they are heterogeneous objects. They are also referred to as a recursive vector.

x_lst <- list(a = 1:3,
              b = data.frame(x = 0:3, y = 10:13))
x_lst
## $a
## [1] 1 2 3
## 
## $b
##   x  y
## 1 0 10
## 2 1 11
## 3 2 12
## 4 3 13

Access elements in a list either by name (using $) or by position (using [[):

x_lst$a
## [1] 1 2 3
x_lst[[1]]
## [1] 1 2 3

Note that x[1] would return a list with one element whereas x[[1]] “pulls” the actual value from the list and, in this example, returns a vector.

class(x_lst[1])
## [1] "list"
class(x_lst[[1]])
## [1] "integer"

Many functions in R return objects that are, in essence, a list. For example, the lm() function, used to fit linear models, returns a list object (although the object has class “lm”):

lm_example <- lm(dist ~ speed, data = cars)
class(lm_example)
## [1] "lm"
str(lm_example) # structure of the object
## List of 12
##  $ coefficients : Named num [1:2] -17.58 3.93
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "speed"
##  $ residuals    : Named num [1:50] 3.85 11.85 -5.95 12.05 2.12 ...
##   ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:50] -303.914 145.552 -8.115 9.885 0.194 ...
##   ..- attr(*, "names")= chr [1:50] "(Intercept)" "speed" "" "" ...
##  $ rank         : int 2
##  $ fitted.values: Named num [1:50] -1.85 -1.85 9.95 9.95 13.88 ...
##   ..- attr(*, "names")= chr [1:50] "1" "2" "3" "4" ...
##  $ assign       : int [1:2] 0 1
##  $ qr           :List of 5
##   ..$ qr   : num [1:50, 1:2] -7.071 0.141 0.141 0.141 0.141 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:50] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "speed"
##   .. ..- attr(*, "assign")= int [1:2] 0 1
##   ..$ qraux: num [1:2] 1.14 1.27
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-07
##   ..$ rank : int 2
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 48
##  $ xlevels      : Named list()
##  $ call         : language lm(formula = dist ~ speed, data = cars)
##  $ terms        :Classes 'terms', 'formula'  language dist ~ speed
##   .. ..- attr(*, "variables")= language list(dist, speed)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "dist" "speed"
##   .. .. .. ..$ : chr "speed"
##   .. ..- attr(*, "term.labels")= chr "speed"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(dist, speed)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
##  $ model        :'data.frame':   50 obs. of  2 variables:
##   ..$ dist : num [1:50] 2 10 4 22 16 10 18 26 34 17 ...
##   ..$ speed: num [1:50] 4 4 7 7 8 9 10 10 10 11 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language dist ~ speed
##   .. .. ..- attr(*, "variables")= language list(dist, speed)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "dist" "speed"
##   .. .. .. .. ..$ : chr "speed"
##   .. .. ..- attr(*, "term.labels")= chr "speed"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(dist, speed)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "dist" "speed"
##  - attr(*, "class")= chr "lm"

Data Frames

The data.frame object is the most common object to represent tabular data (usually imported using read.csv() and related functions).

It is a 2-dimensional object but, unlike matrices, it is heterogeneous (each column can be of a different type). The data.frame object is in fact a type of list, where the elements (columns) are all of the same length.

Similarly to a list, we can extract elements by name or position:

cars$speed
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25
cars[[1]]
##  [1]  4  4  7  7  8  9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15
## [26] 15 16 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

Packages such as dplyr (and the tidyverse more generally) can be used to manipulate data.frame objects.

Functions

When repeating a task, it can be useful to write our own custom functions.

Here is the basic syntax:

my_function <- function(arguments){
  # code for what the function does
  # return value
}

For example, let’s define a function that makes min-max scaling of a vector:

\(x' = \frac{x - min(x)}{max(x) - min(x)}\)

scale_minmax <- function(x){
  x_scaled <- (x - min(x))/(max(x) - min(x))
  return(x_scaled)
}

And test this function on a vector of numbers:

scale_minmax(1:10)
##  [1] 0.0000000 0.1111111 0.2222222 0.3333333 0.4444444 0.5555556 0.6666667
##  [8] 0.7777778 0.8888889 1.0000000

Including checks in your functions

When writing your own functions, it is a good practice to include as many checks as possible, to avoid possible unintended errors or (even worse) wrong answers.

Using if/else statements can be very useful in these circumstances. For example, here is how we could improve our previous function:

scale_minmax <- function(x){
  # check if x is numeric
  if(!is.numeric(x)){
    stop("x must be numeric.")
  }
  # check if there are infinite values
  if(any(is.infinite(x))){
    warning("Infinite values present in the vector. This will result in NaN values.")
  }
  
  x_scaled <- (x - min(x))/(max(x) - min(x))
  return(x_scaled)
}

We should now get a warning when there are infinite values, or an error if the vector is not character:

scale_minmax(c(1, 2, Inf))
## Warning in scale_minmax(c(1, 2, Inf)): Infinite values present in the vector.
## This will result in NaN values.
## [1]   0   0 NaN

Add a new argument to this option, to enable the user to ignore missing values when scaling the data.

Answer

Following the convention of other functions such as mean(), min(), max(), etc., we can add an argument called na.rm as follows:

scale_minmax <- function(x, na.rm = FALSE){
  # check if x is numeric
  if(!is.numeric(x)){
    stop("x must be numeric.")
  }
  # check if there are infinite values
  if(any(is.infinite(x))){
    warning("Infinite values present in the vector. This will result in NaN values.")
  }

  x_scaled <- (x - min(x, na.rm = na.rm))/(max(x, na.rm = na.rm) - min(x, na.rm = na.rm))
  return(x_scaled)
}

We can now test our function:

scale_minmax(c(1, 2, NA))
## [1] NA NA NA
scale_minmax(c(1, 2, NA), na.rm = TRUE)
## [1]  0  1 NA

For loops

Repeating operations is a common task in any programming language. The for() loop can be a useful tool in these situations.

Here is a simple example:

for (i in 1:3){
  print(i)
}
## [1] 1
## [1] 2
## [1] 3

The basic principle of the for() loop is that a variable is created which will store a different value at each iteration of the loop. In this case we are defining a variable called i, which will take the values 1, 2 and 3 in successive rounds of the loop.

Often, we want to store the result of the operations done within the loop in an object, so that we can analyse its results later on. Due to its flexibility, we can use a list for this purpose (although an atomic vector may also be suitable in some cases).

For example, let’s sample 10 numbers from a normal distribution and repeat that 5 times:

# number of sampling repeats
N <- 5

# create an empty list of length equal to the number of repeats
result <- vector(mode = "list", length = N)

# perform our sampling N times
for (i in 1:N){
  # sample from normal distribution and add to results list
  result[[i]] <- rnorm(10, mean = 0, sd = 1)
}

The result is a list of length 5:

result
## [[1]]
##  [1] -0.7032814 -0.7147582  1.8243861  2.1426399 -0.5198499  1.9226053
##  [7] -0.4389941 -0.2235442  0.1176916  0.8642539
## 
## [[2]]
##  [1] -1.7220336 -0.1687946  0.6491491 -0.7460114 -1.0413941  0.3243318
##  [7] -0.7257468  1.4558824 -0.2667898  0.1253643
## 
## [[3]]
##  [1]  0.51447597  1.53149371 -0.04340550 -0.12943045  0.63700766  0.20168534
##  [7] -0.08620392 -0.06855164  1.31109949  0.48792483
## 
## [[4]]
##  [1]  0.31508393 -0.59964481  0.24411850 -0.70063314 -1.60712453 -0.62513205
##  [7]  0.36343503  1.07721591  2.28328982 -0.06790124
## 
## [[5]]
##  [1] -0.90196187  0.19253706  0.85262083 -0.62637872 -0.27215829  0.15267985
##  [7] -0.13196087  1.33199044  0.05465849 -1.56015841

Something to note in the code above is that we used the vector() function to create an empty list, whereas earlier we had manually created a list with the list() function. This is a little confusing, but the thing to remember is that a list is a type of vector - a recursive vector - and when we want to create an empty list we have to use the vector() function as shown above. If we are manually creating a list with specific values, then we can use the list() function.

For example, here is how you could create different types of empty objects to store your results:

  • vector(mode = "numeric", length = 5) - a numeric atomic vector
  • vector(mode = "character", length = 5) - a character atomic vector
  • vector(mode = "logical", length = 5) - a logical atomic vector
  • vector(mode = "list", length = 5) - a list

When adding results to an object with a for loop, always create an empty object first with the right size for the number of iterations you will be doing. In the example above, we created an empty list of length 5, since we were repeating our operation 5 times. This will make your code run much faster.

Part I

Data about world-wide income per person between 1960 and 2010 was downloaded from the Gapminder foundation and stored as a series of files, containing the data for each country individually.

The data are stored in text files in a directory entitled “data/multiple_files”. There is one text file for each country. The format of the data in each text file is the same:

  • The name of the file gives the country name
  • The first line gives the world region they occur in
  • The rest of the data starts at line 3 of the file, and is in a comma-separated format

Create a function called read_gapminder that reads in the data from a single file and returns a data.frame object. The function should:

  • Accept a single argument (a character string of the file location)
  • Get the main data, stored in line 3 ownwards of the file
  • Extract the country from the filename and store this in a new column of the final data.frame
  • Get the world region from the first line of the file and store this in a new column of the final data.frame
Hints
  • Look at the help of the read_csv() function to find an option that allows you to start reading a file not from the first line but from a different line.
  • The read_lines() function is useful to read each line of a file and store it as a vector. You can try using it to read the first line of the file, containing the world region.
  • The function basename() is useful to retrieve the name of a file in a file path that contains directories.
  • Remember the str_*() family of functions, which can be useful to remove text patterns from a string.

Part II

Now that you have a function to read and tidy a single file, use a for() loop to read all the files into a list. Once you read all the files, you may combine them into a single data.frame using the bind_rows() function.

Hints
  • The function list.files() may be useful for you to create a vector with the names of the files from the directory.
  • Remember to create an empty list before starting your for loop. Make sure you create a list with the same length as the length of the vector of file names.

Part III

In R it is not very common to encounter for loops. This is because most R users use a functional approach to programming where, instead of writing a for loop, they use functions that use other functions to iterate through a list of values.
Read the section Loops vs Functionals from the R4DS book and see if you can replace your loop from the previous task with the map function.

Answer

The answer to this exercise is in a separate document