This lesson is still being designed and assembled (Pre-Alpha version)

Working with Tabular Data

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How to import tabular data into R?

  • What kind of object stores tabular data?

  • How to investigate the contents of such object (types of variables and missing data)?

Objectives
  • Import tabular data into R using the read_csv() function.

  • Understand the relationship between vectors and data.frames/tibbles and distinguish between them.

  • Discuss the importance of and apply basic quality control checks of data.

  • Apply the following functions to examine data: summary(), str(), unique(), length()

Presentation of the Gapminder Dataset

Throughout this course, we will use data from the Gapminder Foundation, which gives access to global data as well as many tools to help explore it. In these lessons we’re going to use some of these data to explore some of these data ourselves.

We have two files with data relating to socio-economic statistics: world data for 2010 only and the same data for 1960 to 2010 (see the setup page for instructions on how to download the data). In this lesson we will use the 2010 data only, and then use the full dataset in future episodes.

Both of these have the same columns of data:

Column Description
country country name
world_region 6 world regions
year year that each datapoint refers to
children_per_woman total fertility rate
life_expectancy average number of years a newborn child would live if current mortality patterns were to stay the same
income_per_person gross domestic product per person adjusted for differences in purchasing power
is_oecd Whether a country belongs to the “OECD” (TRUE) or not (FALSE)
income_groups categorical classification of income groups
population total number of a country’s population
main_religion religion of the majority of population in 2008
child_mortality death of children under 5 years old per 1000 births
life_expectancy_female life expectancy at birth, females
life_expectancy_male life expectancy at birth, males

Reading Data Into R

There are several functions to read data into R, we’re going to use one from the readr package, which is part of the tidyverse. As such, we first need to load the package into R’s memory, by using the library() function:

library(tidyverse)

This command has to be run every time you start a new R session. Tipycally you want to include the library() calls at the top of your script, so that a user knows which packages need to be installed to run the analysis.

Our data is provided in CSV format (comma separated values). This format is a regular text file, where each value (or column of the table) is separated by a comma. To read such a file, we use the read_csv() function, which needs at least one input: the path of the file we want to read. It is also good practice to explicitly define how missing data is encoded in the file with the na option. In our case, missing data are encoded as an empty string (imagine this as an empty cell in a spreadsheet).

Here’s the command:

gapminder2010 <- read_csv("data/raw/gapminder2010_socioeconomic.csv", na = "")
Parsed with column specification:
cols(
  country = col_character(),
  world_region = col_character(),
  year = col_double(),
  children_per_woman = col_double(),
  life_expectancy = col_double(),
  income_per_person = col_double(),
  is_oecd = col_logical(),
  income_groups = col_character(),
  population = col_double(),
  main_religion = col_character(),
  child_mortality = col_double(),
  life_expectancy_female = col_character(),
  life_expectancy_male = col_double()
)

We see a lot of output there, but this is not an error! It’s a message that read_csv() prints to inform us of what type of data it thinks each column of the dataset is. We’ll discuss this in a while.

For now, let’s investigate what kind of object the read_csv() function returns.

Exercise

Try and answer these questions about gapminder2010:

  1. What type of object is it? (hint: class())
  2. How many rows and columns does it have? (hint: ncol(), nrow())
  3. What type of values does each column have? (hint: str())
  4. Inspect the range of values in the numeric columns using summary().

In all these inspections, can you spot any suspicious problems with the data?

Solution

A1. This object has several classes, variants of ‘tbl’ and ‘data.frame’:

class(gapminder2010)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 

A2. It has 193 rows and 19 columns

nrow(gapminder2010)
[1] 193
ncol(gapminder2010)
[1] 13

A3. str() gives us the structure of the object, which provides information about each column type as well as the class and dimensions of the data frame:

str(gapminder2010)
tibble [193 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ country               : chr [1:193] "Afghanistan" "Angola" "Albania" "Andorra" ...
 $ world_region          : chr [1:193] "south_asia" "sub_saharan_africa" "europe_central_asia" "europe_central_asia" ...
 $ year                  : num [1:193] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
 $ children_per_woman    : num [1:193] 5.82 6.16 1.65 NA 1.87 2.37 1.55 2.13 1.93 1.44 ...
 $ life_expectancy       : num [1:193] 59.9 59.9 77.6 82.3 72.9 ...
 $ income_per_person     : num [1:193] 1672 6360 9928 38982 55363 ...
 $ is_oecd               : logi [1:193] FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ income_groups         : chr [1:193] "low_income" "lower_middle_income" "upper_middle_income" "high_income" ...
 $ population            : num [1:193] 29185511 23356247 2948029 NA 8549998 ...
 $ main_religion         : chr [1:193] "muslim" "christian" "muslim" "christian" ...
 $ child_mortality       : num [1:193] 87.95 120.49 13.3 4.18 8.48 ...
 $ life_expectancy_female: chr [1:193] "62.459" "58.033" "79.302" "-" ...
 $ life_expectancy_male  : num [1:193] 59.7 52.8 74.1 -999 75.6 ...
 - attr(*, "spec")=
  .. cols(
  ..   country = col_character(),
  ..   world_region = col_character(),
  ..   year = col_double(),
  ..   children_per_woman = col_double(),
  ..   life_expectancy = col_double(),
  ..   income_per_person = col_double(),
  ..   is_oecd = col_logical(),
  ..   income_groups = col_character(),
  ..   population = col_double(),
  ..   main_religion = col_character(),
  ..   child_mortality = col_double(),
  ..   life_expectancy_female = col_character(),
  ..   life_expectancy_male = col_double()
  .. )

A4. The summary function gives us basic summary statistics for each numeric column and indicates which columns are of “character” type.

summary(gapminder2010)
   country          world_region            year      children_per_woman
 Length:193         Length:193         Min.   :2010   Min.   :1.190     
 Class :character   Class :character   1st Qu.:2010   1st Qu.:1.810     
 Mode  :character   Mode  :character   Median :2010   Median :2.475     
                                       Mean   :2010   Mean   :2.995     
                                       3rd Qu.:2010   3rd Qu.:4.043     
                                       Max.   :2010   Max.   :7.490     
                                                      NA's   :9         
 life_expectancy income_per_person  is_oecd        income_groups     
 Min.   :32.54   Min.   :   614    Mode :logical   Length:193        
 1st Qu.:63.88   1st Qu.:  3338    FALSE:162       Class :character  
 Median :73.31   Median :  9928    TRUE :31        Mode  :character  
 Mean   :70.47   Mean   : 16700                                      
 3rd Qu.:76.97   3rd Qu.: 22405                                      
 Max.   :83.19   Max.   :119974                                      
 NA's   :6                                                           
   population        main_religion      child_mortality  life_expectancy_female
 Min.   :8.803e+04   Length:193         Min.   :  2.62   Length:193            
 1st Qu.:2.570e+06   Class :character   1st Qu.:  8.70   Class :character      
 Median :8.613e+06   Mode  :character   Median : 20.23   Mode  :character      
 Mean   :3.760e+07                      Mean   : 38.19                         
 3rd Qu.:2.534e+07                      3rd Qu.: 58.23                         
 Max.   :1.369e+09                      Max.   :208.62                         
 NA's   :9                                                                     
 life_expectancy_male
 Min.   :-999.00     
 1st Qu.:  60.10     
 Median :  68.18     
 Mean   :  17.61     
 3rd Qu.:  73.42     
 Max.   :  80.08     
                     

From this inspection, we can see that the columns named life_expectancy_female has been imported as “character”, although is should probably be numeric. Also, the minimum value in the life_expectancy_male column is -999, which should not be possible (only positive values should be allowed).

Reading data: many functions available

The readr package has several functions to read data in different formats. Some common ones are:

  • read_csv() - for comma separated values
  • read_tsv() - for tab separated values
  • read_csv2() - for CSV files exported from non-english spreadsheet programs that use the semi-colon ; as a separator and a comma , as the decimal place.
  • read_table() - to read data where each column is separated by one or more spaces.
  • read_delim() - a flexible function that allows you to define your own delimiter.

These functions have equivalents in base R (the default installation of R), which you can also use. They are very similarly named, for example: read.csv() and read.table() (notice the . instead of _ in the function name). However, they have different default options, so pay attention to which one you use!

The data.frame Object

A data.frame is the basic type of object that stores tabular data. The readr package reads data in an “extended” version of a data frame that it calls tibble (tbl for short). The details of their differences are not very important unless you are a programmer, but tibbles offer some user conveniences such as a better printing method. For the rest of the course we’ll refer to “data frames” and “tibbles” interchangeably.

A technical point that is worth being aware of is that a data frame is a list of vectors of the same length. In other words, we can manipulate each column in a data frame just as we’ve learned for vectors in the previous episode.

To return the values of a column as a vector, we can use the $ accessor. For example try:

gapminder2010$country

Exercise

Using the $ accessor and some functions that work on vectors, try and answer the following questions:

  1. What are the distinct values present in the world_region column? (hint: unique())
  2. How many distinct regions are there? (hint: combine length() and unique())
  3. How many countries are there in each of those regions? (hint: table())
  4. Do this same exploratory analysis on the main_religion column and note any problems you may notice.

Answer

A1.

unique(gapminder2010$world_region)
[1] "south_asia"               "sub_saharan_africa"      
[3] "europe_central_asia"      "middle_east_north_africa"
[5] "america"                  "east_asia_pacific"       

A2. We could have “manually” counted the categories from the previous answer, but to do it programatically, we can do:

length(unique(gapminder2010$world_region))
[1] 6

A3. The table function counts how many of each unique value there is in a vector.

table(gapminder2010$world_region)

                 america        east_asia_pacific      europe_central_asia 
                      35                       30                       52 
middle_east_north_africa               south_asia       sub_saharan_africa 
                      20                        8                       48 

A4. From looking at the unique values:

unique(gapminder2010$main_religion)
[1] "muslim"              "christian"           "Muslim"             
[4] NA                    "eastern religions"   "Christian"          
[7] "eastern   religions"

We can see that:

  • “muslim” and “christian” are sometimes uppercase and other times lowercase
  • there is a value of “eastern religions” that seems to have extra spaces between the two words.
  • There are some missing values in this column

By tabulating these values we can see the categories with the lowest counts must be the ones with a mistake in them.

table(gapminder2010$main_religion)

          christian           Christian eastern   religions   eastern religions 
                107                  12                   1                  11 
             muslim              Muslim 
                 41                   4 

Because each column in a data frame is a vector, this also means that columns of a data frame can only contain data of one type (e.g. all character or all numeric). The consequence of this is that when you read data, the read_csv() function will try to guess what data type each column is. And if it sees numbers mixed with characters it will consider the column to be character (this is the value coercion that we’ve discussed about in the previous episode).

We can see this feature affected the values in the life_expectancy_female column, which are of type “character”, when we would expect them to be “numeric”:

class(gapminder2010$life_expectancy_female)
[1] "character"

If you print the contents of that column, you will notice that some values are “-“, which are probably missing values that were encoded with that symbol rather than with an empty value. We will learn how to fix this issue in a future episode.

Subsetting Rows and Columns with [, ]

Similarly to what we’ve seen in the previous episode for vectors, we can access values inside a data.frame using the square-bracket construct. The difference is that now we have two dimensions: rows and columns. Therefore, the syntax is slightly different, because we have to specificy these two dimensions as [rows, columns]. For example, try running these:

# access the first row and columns 3 to 6
gapminder2010[1, 3:6]

# access the first 5 rows and all columns
gapminder2010[1:5, ]

# access the first 2 columns and all rows
gapminder2010[ , 1:2]

You can also access columns by name, for example:

gapminder2010[, c("country", "income_per_person")]

Data tip: quality control checks

Whenever you read data into R, it’s always good practice to do some quality checks. Here’s a list of things to lookout for:

  • Do you have the expected number of rows and columns?
  • Are your variables (columns) of the expected type? (e.g. numeric, character)
  • Is the range of numeric data within expected boundaries? For example: a column with months should go from 1 to 12; a column with human heights in cm should not have values below 30 or so; etc…
  • Do you have the expected number of unique values in categorical (character) variables?
  • Do you have missing values in the data, and were these imported correctly?

In R, you can answer many of these questions with the help of the following functions: str(), summary(), length() + unique(), nrow(), ncol(), is.na().

There are also some R packages that can help making these diagnostic analysis easier. One good one is the skimr package (which has a good introduction document. If you use the skim() function with your data.frame it will give you a tabular summary that will help you answer all these questions in one go!

Key Points

  • Use library() to load a library into R. You need to do this every time you start a new R session.

  • Read data using the read_*() family of functions (read_csv() and read_tsv() are two common types for comma- and tab-delimited values, respectively).

  • In R tabular data is stored in a data.frame object.

  • Columns in a data.frame are vectors. Therefore, a data.frame is a list of vectors of the same length.

  • A vector can only contain data of one type (e.g. all numeric, or all character). Therefore, each column of a data.frame can only be of one type also (although different columns may be of different types).