Working with Tabular Data
Overview
Teaching: 40 min
Exercises: 20 minQuestions
How to import tabular data into R?
What kind of object stores tabular data?
How to investigate the contents of such object (types of variables and missing data)?
Objectives
Import tabular data into R using the
read_csv()
function.Understand the relationship between vectors and data.frames/tibbles and distinguish between them.
Discuss the importance of and apply basic quality control checks of data.
Apply the following functions to examine data:
summary()
,str()
,unique()
,length()
Presentation of the Gapminder Dataset
Throughout this course, we will use data from the Gapminder Foundation, which gives access to global data as well as many tools to help explore it. In these lessons we’re going to use some of these data to explore some of these data ourselves.
We have two files with data relating to socio-economic statistics: world data for 2010 only and the same data for 1960 to 2010 (see the setup page for instructions on how to download the data). In this lesson we will use the 2010 data only, and then use the full dataset in future episodes.
Both of these have the same columns of data:
Column | Description |
---|---|
country | country name |
world_region | 6 world regions |
year | year that each datapoint refers to |
children_per_woman | total fertility rate |
life_expectancy | average number of years a newborn child would live if current mortality patterns were to stay the same |
income_per_person | gross domestic product per person adjusted for differences in purchasing power |
is_oecd | Whether a country belongs to the “OECD” (TRUE ) or not (FALSE ) |
income_groups | categorical classification of income groups |
population | total number of a country’s population |
main_religion | religion of the majority of population in 2008 |
child_mortality | death of children under 5 years old per 1000 births |
life_expectancy_female | life expectancy at birth, females |
life_expectancy_male | life expectancy at birth, males |
Reading Data Into R
There are several functions to read data into R, we’re going to use one from the
readr
package, which is part of the tidyverse
. As such, we first need to load
the package into R’s memory, by using the library()
function:
library(tidyverse)
This command has to be run every time you start a new R session. Tipycally you
want to include the library()
calls at the top of your script, so that a user
knows which packages need to be installed to run the analysis.
Our data is provided in CSV format (comma separated values). This format is a
regular text file, where each value (or column of the table) is separated by a
comma. To read such a file, we use the read_csv()
function, which needs at least
one input: the path of the file we want to read. It is also good practice
to explicitly define how missing data is encoded in the file with the na
option.
In our case, missing data are encoded as an empty string (imagine this as an empty
cell in a spreadsheet).
Here’s the command:
gapminder2010 <- read_csv("data/raw/gapminder2010_socioeconomic.csv", na = "")
Parsed with column specification:
cols(
country = col_character(),
world_region = col_character(),
year = col_double(),
children_per_woman = col_double(),
life_expectancy = col_double(),
income_per_person = col_double(),
is_oecd = col_logical(),
income_groups = col_character(),
population = col_double(),
main_religion = col_character(),
child_mortality = col_double(),
life_expectancy_female = col_character(),
life_expectancy_male = col_double()
)
We see a lot of output there, but this is not an error! It’s a message that read_csv()
prints to inform us of what type of data it thinks each column of the dataset is.
We’ll discuss this in a while.
For now, let’s investigate what kind of object the read_csv()
function returns.
Exercise
Try and answer these questions about
gapminder2010
:
- What type of object is it? (hint:
class()
)- How many rows and columns does it have? (hint:
ncol()
,nrow()
)- What type of values does each column have? (hint:
str()
)- Inspect the range of values in the numeric columns using
summary()
.In all these inspections, can you spot any suspicious problems with the data?
Solution
A1. This object has several classes, variants of ‘tbl’ and ‘data.frame’:
class(gapminder2010)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
A2. It has 193 rows and 19 columns
nrow(gapminder2010)
[1] 193
ncol(gapminder2010)
[1] 13
A3.
str()
gives us the structure of the object, which provides information about each column type as well as the class and dimensions of the data frame:str(gapminder2010)
tibble [193 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame) $ country : chr [1:193] "Afghanistan" "Angola" "Albania" "Andorra" ... $ world_region : chr [1:193] "south_asia" "sub_saharan_africa" "europe_central_asia" "europe_central_asia" ... $ year : num [1:193] 2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ... $ children_per_woman : num [1:193] 5.82 6.16 1.65 NA 1.87 2.37 1.55 2.13 1.93 1.44 ... $ life_expectancy : num [1:193] 59.9 59.9 77.6 82.3 72.9 ... $ income_per_person : num [1:193] 1672 6360 9928 38982 55363 ... $ is_oecd : logi [1:193] FALSE FALSE FALSE FALSE FALSE FALSE ... $ income_groups : chr [1:193] "low_income" "lower_middle_income" "upper_middle_income" "high_income" ... $ population : num [1:193] 29185511 23356247 2948029 NA 8549998 ... $ main_religion : chr [1:193] "muslim" "christian" "muslim" "christian" ... $ child_mortality : num [1:193] 87.95 120.49 13.3 4.18 8.48 ... $ life_expectancy_female: chr [1:193] "62.459" "58.033" "79.302" "-" ... $ life_expectancy_male : num [1:193] 59.7 52.8 74.1 -999 75.6 ... - attr(*, "spec")= .. cols( .. country = col_character(), .. world_region = col_character(), .. year = col_double(), .. children_per_woman = col_double(), .. life_expectancy = col_double(), .. income_per_person = col_double(), .. is_oecd = col_logical(), .. income_groups = col_character(), .. population = col_double(), .. main_religion = col_character(), .. child_mortality = col_double(), .. life_expectancy_female = col_character(), .. life_expectancy_male = col_double() .. )
A4. The summary function gives us basic summary statistics for each numeric column and indicates which columns are of “character” type.
summary(gapminder2010)
country world_region year children_per_woman Length:193 Length:193 Min. :2010 Min. :1.190 Class :character Class :character 1st Qu.:2010 1st Qu.:1.810 Mode :character Mode :character Median :2010 Median :2.475 Mean :2010 Mean :2.995 3rd Qu.:2010 3rd Qu.:4.043 Max. :2010 Max. :7.490 NA's :9 life_expectancy income_per_person is_oecd income_groups Min. :32.54 Min. : 614 Mode :logical Length:193 1st Qu.:63.88 1st Qu.: 3338 FALSE:162 Class :character Median :73.31 Median : 9928 TRUE :31 Mode :character Mean :70.47 Mean : 16700 3rd Qu.:76.97 3rd Qu.: 22405 Max. :83.19 Max. :119974 NA's :6 population main_religion child_mortality life_expectancy_female Min. :8.803e+04 Length:193 Min. : 2.62 Length:193 1st Qu.:2.570e+06 Class :character 1st Qu.: 8.70 Class :character Median :8.613e+06 Mode :character Median : 20.23 Mode :character Mean :3.760e+07 Mean : 38.19 3rd Qu.:2.534e+07 3rd Qu.: 58.23 Max. :1.369e+09 Max. :208.62 NA's :9 life_expectancy_male Min. :-999.00 1st Qu.: 60.10 Median : 68.18 Mean : 17.61 3rd Qu.: 73.42 Max. : 80.08
From this inspection, we can see that the columns named
life_expectancy_female
has been imported as “character”, although is should probably be numeric. Also, the minimum value in thelife_expectancy_male
column is -999, which should not be possible (only positive values should be allowed).
Reading data: many functions available
The
readr
package has several functions to read data in different formats. Some common ones are:
read_csv()
- for comma separated valuesread_tsv()
- for tab separated valuesread_csv2()
- for CSV files exported from non-english spreadsheet programs that use the semi-colon;
as a separator and a comma,
as the decimal place.read_table()
- to read data where each column is separated by one or more spaces.read_delim()
- a flexible function that allows you to define your own delimiter.These functions have equivalents in base R (the default installation of R), which you can also use. They are very similarly named, for example:
read.csv()
andread.table()
(notice the.
instead of_
in the function name). However, they have different default options, so pay attention to which one you use!
The data.frame
Object
A data.frame is the basic type of object that stores tabular data.
The readr
package reads data in an “extended” version of a data frame that it
calls tibble (tbl
for short). The details of their differences are not very
important unless you are a programmer, but tibbles offer some user conveniences
such as a better printing method. For the rest of the course we’ll refer to
“data frames” and “tibbles” interchangeably.
A technical point that is worth being aware of is that a data frame is a list of vectors of the same length. In other words, we can manipulate each column in a data frame just as we’ve learned for vectors in the previous episode.
To return the values of a column as a vector, we can use the $
accessor.
For example try:
gapminder2010$country
Exercise
Using the
$
accessor and some functions that work on vectors, try and answer the following questions:
- What are the distinct values present in the
world_region
column? (hint:unique()
)- How many distinct regions are there? (hint: combine
length()
andunique()
)- How many countries are there in each of those regions? (hint:
table()
)- Do this same exploratory analysis on the
main_religion
column and note any problems you may notice.Answer
A1.
unique(gapminder2010$world_region)
[1] "south_asia" "sub_saharan_africa" [3] "europe_central_asia" "middle_east_north_africa" [5] "america" "east_asia_pacific"
A2. We could have “manually” counted the categories from the previous answer, but to do it programatically, we can do:
length(unique(gapminder2010$world_region))
[1] 6
A3. The table function counts how many of each unique value there is in a vector.
table(gapminder2010$world_region)
america east_asia_pacific europe_central_asia 35 30 52 middle_east_north_africa south_asia sub_saharan_africa 20 8 48
A4. From looking at the unique values:
unique(gapminder2010$main_religion)
[1] "muslim" "christian" "Muslim" [4] NA "eastern religions" "Christian" [7] "eastern religions"
We can see that:
- “muslim” and “christian” are sometimes uppercase and other times lowercase
- there is a value of “eastern religions” that seems to have extra spaces between the two words.
- There are some missing values in this column
By tabulating these values we can see the categories with the lowest counts must be the ones with a mistake in them.
table(gapminder2010$main_religion)
christian Christian eastern religions eastern religions 107 12 1 11 muslim Muslim 41 4
Because each column in a data frame is a vector, this also means that columns of
a data frame can only contain data of one type (e.g. all character or all numeric).
The consequence of this is that when you read data, the read_csv()
function will try to guess what data type each column is. And if it sees numbers
mixed with characters it will consider the column to be character (this is the
value coercion that we’ve discussed about in the
previous episode).
We can see this feature affected the values in the life_expectancy_female
column,
which are of type “character”, when we would expect them to be “numeric”:
class(gapminder2010$life_expectancy_female)
[1] "character"
If you print the contents of that column, you will notice that some values are “-“, which are probably missing values that were encoded with that symbol rather than with an empty value. We will learn how to fix this issue in a future episode.
Subsetting Rows and Columns with [, ]
Similarly to what we’ve seen in the
previous episode
for vectors, we can access values inside a data.frame
using the square-bracket construct.
The difference is that now we have two dimensions: rows and columns.
Therefore, the syntax is slightly different, because we have to specificy these two
dimensions as [rows, columns]
.
For example, try running these:
# access the first row and columns 3 to 6
gapminder2010[1, 3:6]
# access the first 5 rows and all columns
gapminder2010[1:5, ]
# access the first 2 columns and all rows
gapminder2010[ , 1:2]
You can also access columns by name, for example:
gapminder2010[, c("country", "income_per_person")]
Data tip: quality control checks
Whenever you read data into R, it’s always good practice to do some quality checks. Here’s a list of things to lookout for:
- Do you have the expected number of rows and columns?
- Are your variables (columns) of the expected type? (e.g. numeric, character)
- Is the range of numeric data within expected boundaries? For example: a column with months should go from 1 to 12; a column with human heights in cm should not have values below 30 or so; etc…
- Do you have the expected number of unique values in categorical (character) variables?
- Do you have missing values in the data, and were these imported correctly?
In R, you can answer many of these questions with the help of the following functions:
str()
,summary()
,length()
+unique()
,nrow()
,ncol()
,is.na()
.There are also some R packages that can help making these diagnostic analysis easier. One good one is the
skimr
package (which has a good introduction document. If you use theskim()
function with yourdata.frame
it will give you a tabular summary that will help you answer all these questions in one go!
Key Points
Use
library()
to load a library into R. You need to do this every time you start a new R session.Read data using the
read_*()
family of functions (read_csv()
andread_tsv()
are two common types for comma- and tab-delimited values, respectively).In R tabular data is stored in a
data.frame
object.Columns in a
data.frame
are vectors. Therefore, adata.frame
is a list of vectors of the same length.A vector can only contain data of one type (e.g. all numeric, or all character). Therefore, each column of a
data.frame
can only be of one type also (although different columns may be of different types).