We were given 193 text files, each containing data for a country. The format of each text file is the same:
We were asked to parse these files and combine them into a tidy tabular format.
There are many ways to solve this exercise. Here, we show a solution using several functions from the tidyverse
meta-package.
# load the package
library(tidyverse)
For any problem where we need to repeat the same task several times, it helps to break down the problem into parts. We can write down what the different steps are using words and not worrying about the code.
Here’s an example for our problem of parsing and combining the data files:
country
, world_region
, year
, income_per_person
Now that we’ve broken our problem into parts, let’s tackle them in turn.
If we open one of our files in a text editor, this is how they look like (here showing the first few lines only):
## Region:South Asia
##
## year,income_per_person
## 1960,2744
## 1961,2702
## 1962,2683
There are 3 pieces of information that we should bring together into a single tidy data frame:
That is, we should aim for something like this:
country | world_region | year | income_per_person |
---|---|---|---|
Afghanistan | South Asia | 1960 | 2744 |
Afghanistan | South Asia | 1961 | 2702 |
Afghanistan | South Asia | 1962 | 2683 |
Afghanistan | South Asia | 1963 | 2665 |
Afghanistan | South Asia | 1964 | 2649 |
Afghanistan | South Asia | 1965 | 2641 |
We start by focusing on a single file, and then generalise across all files.
If you look at the help of ?read_csv
, you will notice that these read_*
functions have an option to skip some of the lines of the file. This is very convenient, as it allows us to read data starting from line 3, which contain the bulk of the information for our country:
# read the table starting from line 3
read_csv("data/multiple_files/Afghanistan.csv", skip = 2)
## # A tibble: 51 × 2
## year income_per_person
## <dbl> <dbl>
## 1 1960 2744
## 2 1961 2702
## 3 1962 2683
## 4 1963 2665
## 5 1964 2649
## 6 1965 2641
## 7 1966 2598
## 8 1967 2601
## 9 1968 2623
## 10 1969 2594
## # … with 41 more rows
So this solves one of our steps! (note that base R functions such as read.table()
also have an option to skip lines of a file before reading it in)
Let’s now turn into reading the first line of the file, which contains the diet identifier. Again, if you look at the help of read_*
functions, you will notice there is an option n_max
, which determines how many lines of the file we want to read in (the default is to read all lines). We can therefore use this option to read the first line of the file only.
We will use a function called read_lines()
, which simply reads the lines of a file and stores them into a vector. This function is not often used, but can be useful when parsing data.
# Read the first line of the file
read_lines("data/multiple_files/Afghanistan.csv", n_max = 1)
## [1] "Region:South Asia"
Because we are reading the first line of the file only, the output is a vector of length 1. For our final dataset, we don’t really need to keep the string “Region:” because we will place these values in a column. Therefore, it is useful to remove that string from our value. We will use the function str_remove()
, which we can pipe from the previous line of code:
# Read first line and remove "Diet:" string
read_lines("data/multiple_files/Afghanistan.csv", n_max = 1) %>%
str_remove("Region:")
## [1] "South Asia"
And that’s another problem solved!
Finally, we want to extract the country name from the file name. So far we’ve been looking at a single file, but eventually we will create a list of file names too apply these steps to. For now, let’s manually type the file name of the first file we’ve been experimenting on:
"data/multiple_files/Afghanistan.csv"
## [1] "data/multiple_files/Afghanistan.csv"
From this file name, we want to extract the file name only (without the directory), which we can do using the basename()
function:
"data/multiple_files/Afghanistan.csv" %>% basename()
## [1] "Afghanistan.csv"
Then, we want to remove the string “.txt” from it, which we’ve already seen can be done with str_remove()
:
"data/multiple_files/Afghanistan.csv" %>%
basename() %>%
str_remove(".csv")
## [1] "Afghanistan"
And that’s it!
We can wrap all these steps into our own custom function. This will be convenient when we want to apply this function to the entire list of files.
<- function(file){
read_gapminder
# Read the data in the file - starts at line 3
<- read_csv(file, skip = 2, show_col_types = FALSE)
country_data
# Extract the world_region from the file - first line of the file
# and remove the word "Region:" from it
<- read_lines(file, n_max = 1) %>%
country_region str_remove("Region:")
# Get the country name - from the file name
<- file %>% basename() %>% str_remove(".csv")
country_name
# Add country and world region to our table
<- country_data %>%
country_data mutate(country = country_name, world_region = country_region)
# Return the tidy data
return(country_data)
}
Let’s test the function:
read_gapminder("data/multiple_files/Afghanistan.csv")
## # A tibble: 51 × 4
## year income_per_person country world_region
## <dbl> <dbl> <chr> <chr>
## 1 1960 2744 Afghanistan South Asia
## 2 1961 2702 Afghanistan South Asia
## 3 1962 2683 Afghanistan South Asia
## 4 1963 2665 Afghanistan South Asia
## 5 1964 2649 Afghanistan South Asia
## 6 1965 2641 Afghanistan South Asia
## 7 1966 2598 Afghanistan South Asia
## 8 1967 2601 Afghanistan South Asia
## 9 1968 2623 Afghanistan South Asia
## 10 1969 2594 Afghanistan South Asia
## # … with 41 more rows
First let’s create a list of data files in our directory:
<- list.files("data/multiple_files/", pattern = ".csv",
file_list full.names = TRUE)
head(file_list)
## [1] "data/multiple_files//Afghanistan.csv"
## [2] "data/multiple_files//Albania.csv"
## [3] "data/multiple_files//Algeria.csv"
## [4] "data/multiple_files//Andorra.csv"
## [5] "data/multiple_files//Angola.csv"
## [6] "data/multiple_files//Antigua and Barbuda.csv"
We can now for example use a for loop to read all the files into a list:
# initialise an empty list with the correct length
<- vector("list", length = length(file_list))
gapminder
# loop through the files and add to the list
for(i in seq_along(file_list)){
<- read_gapminder(file_list[i])
gapminder[[i]] }
The result is now a list containing 193 data frames, one of each country.
Finally, we can “bind” those tables using the bind_rows()
function:
<- bind_rows(gapminder) gapminder
An alternative to using for loops in R is to use functions that are designed to iterate through vectors/lists. Base R has the function lapply()
and tidyverse as the similar function called map()
.
These take as an input a vector and a function, and will apply the function to each element of the vector.
Here is the same task as we did above with the for loop using the lapply()
function instead:
<- lapply(file_list, read_gapminder) gapminder
You can think of map()
/lapply()
as encapsulated for loops. The advantage of using these functions is that the code is more compact and there are less intermediate variables cluttering our environment (in this case, the variable i
was created by the for loop, but no such variable is created by the map()
function).
Learn more about map()
and its variants from the iteration chapter in R for Data Science