Data reshaping: from wide to long and back

Overview

Teaching: 20 min
Exercises: 15 min

Questions

How to change the shape of a table from a ‘wide’ to a ‘long’ format?

When is one or the other format more suitable for analysis?

Objectives

Apply the pivot_wider() and pivot_longer() functions to reshape data frames.

Recognise some cases when using a wide or long format is desirable.

In this lesson we’re going to learn how to use functions from the tidyr package (part of tidyverse) to help us change the shape of our data to suit our needs.

As usual when starting an analysis on a new script, let’s start by loading the packages and reading the data. In this case, let’s use the clean dataset that we created in the last exercise of our previous episode.

# load the package
library(tidyverse)
# Read the data, specifying how missing values are encoded
gapminder_clean <- read_csv("data/processed/gapminder1960to2010_socioeconomic_clean.csv", 
                            na = "")

If you haven’t completed that exercise, here’s how you can recreate the clean dataset:

gapminder_clean <- read_csv("data/raw/gapminder1960to2010_socioeconomic.csv", na = "") %>% 
 # fix typos in main_religion and world region
 mutate(main_religion = str_to_title(str_squish(main_religion)),
        world_region = str_to_title(str_replace_all(world_region, "_", " "))) %>% 
 # fit typos in income groups, which needs more steps
 mutate(income_groups = str_remove(income_groups, "_income")) %>% 
 mutate(income_groups = str_to_title(str_replace_all(income_groups, "_", " "))) %>% 
 # fix/create numeric variables
 mutate(life_expectancy_female = as.numeric(life_expectancy_female),
        life_expectancy_male = ifelse(life_expectancy_male == -999, NA, life_expectancy_male))

Data reshaping

What we mean by “the shape of our data” is how the values are distributed across rows or columns. Here’s a visual representation of the same data in two different shapes:

Data Shapes

“Long” format is where we have a column for each of the types of things we measured or recorded in our data. In other words, each variable has its own column.
“Wide” format occurs when we have data relating to the same measured thing in different columns. In this case, we have values related to our “metric” spread across multiple columns (a column each for a year).

Neither of these formats is necessarily more correct than the other: it will depend on what analysis you intend on doing. However, it is worth mentioning that the “long” format is often preferred, as it is clearer how many distinct types of variables we have in the data.

To figure out which format you might need, it may help to think of which visualisations you may want to build with ggplot2. Taking the above example:

If one was interested in looking at the change of the “metric” across years for each country, then the long format is more suitable to define each aesthetic of such a graph: aes(x = year, y = metric, colour = country).
If one was interested in the correlation of this metric between 1960 and 2010, then the wide format would be more suitable: aes(x = yr1960, y = yr2010, colour = country).

Using the `pivot` functions

Let’s look at how we can reshape our data in practice. There are two functions in the tidyr package (part of tidyverse) that can help us with this:

pivot_wider(): from long to wide format.
pivot_longer(): from wide to long format.

Here is the same picture with added code:

pivot

Let’s say we’d like to look at the correlation of income_per_person between 1960 and 2010 across all the countries in our dataset. To achieve this, we will need to create a “wide” version of our table, where each row is a country and each column a year, with values of income_per_person in each cell of the table.

income_wide <- gapminder_clean %>% 
  # select only the columns we're interested in
  select(country, year, income_per_person) %>% 
  # use pivot_wider to go from long to wide format
  pivot_wider(names_from = "year", 
              names_prefix = "yr",
              values_from = "income_per_person")
income_wide

# A tibble: 193 x 52
   country yr1960 yr1961 yr1962 yr1963 yr1964 yr1965 yr1966 yr1967 yr1968 yr1969
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghan…   2744   2702   2683   2665   2649   2641   2598   2601   2623   2594
 2 Angola    3861   4313   4134   4284   4695   4974   5187   5408   5247   5318
 3 Albania   2585   2607   2693   2785   2880   2985   3098   3215   3331   3444
 4 Andorra  15234  16469  17805  19248  20809  22496  24320  26291  28422  30726
 5 United…   1767   1973   2203   2863   4871   8285  14092  23968  28831  34679
 6 Argent…   9765  10299   9976   9589  10418  11202  11116  11257  11574  12385
 7 Armenia   3003   3120   3153   3034   3378   3524   3651   3769   3946   3959
 8 Antigu…   4419   4571   4727   4888   5055   5227   5405   5588   5777   5973
 9 Austra…  15972  15678  16311  16939  17700  18195  18305  19131  19817  20494
10 Austria  11302  11865  12105  12547  13242  13566  14262  14620  15226  16164
# … with 183 more rows, and 41 more variables: yr1970 <dbl>, yr1971 <dbl>,
#   yr1972 <dbl>, yr1973 <dbl>, yr1974 <dbl>, yr1975 <dbl>, yr1976 <dbl>,
#   yr1977 <dbl>, yr1978 <dbl>, yr1979 <dbl>, yr1980 <dbl>, yr1981 <dbl>,
#   yr1982 <dbl>, yr1983 <dbl>, yr1984 <dbl>, yr1985 <dbl>, yr1986 <dbl>,
#   yr1987 <dbl>, yr1988 <dbl>, yr1989 <dbl>, yr1990 <dbl>, yr1991 <dbl>,
#   yr1992 <dbl>, yr1993 <dbl>, yr1994 <dbl>, yr1995 <dbl>, yr1996 <dbl>,
#   yr1997 <dbl>, yr1998 <dbl>, yr1999 <dbl>, yr2000 <dbl>, yr2001 <dbl>,
#   yr2002 <dbl>, yr2003 <dbl>, yr2004 <dbl>, yr2005 <dbl>, yr2006 <dbl>,
#   yr2007 <dbl>, yr2008 <dbl>, yr2009 <dbl>, yr2010 <dbl>

Note that the names_prefix argument is optional. Here, we’ve used it because it’s a good idea to avoid column names that start with a number. In other cases you can omit it.

Now, we could answer our question of interest:

income_wide %>% 
  ggplot(aes(yr1960, yr2010)) +
  geom_point() +
  geom_abline() +
  scale_x_continuous(trans = "log10") +
  scale_y_continuous(trans = "log10")

plot of chunk unnamed-chunk-6

The oposite of this operation can be achieved with the pivot_longer() function. Let’s reverse the reshaping we did above:

income_wide %>% 
  pivot_longer(cols = yr1960:yr2010,
               names_to = "year", 
               values_to = "income_per_person")

# A tibble: 9,843 x 3
   country     year   income_per_person
   <chr>       <chr>              <dbl>
Afghanistan yr1960              2744
Afghanistan yr1961              2702
Afghanistan yr1962              2683
Afghanistan yr1963              2665
Afghanistan yr1964              2649
Afghanistan yr1965              2641
Afghanistan yr1966              2598
Afghanistan yr1967              2601
Afghanistan yr1968              2623
Afghanistan yr1969              2594
# … with 9,833 more rows

Note that the cols argument of pivot_longer() accepts column specifications in a similar way to the select() function discussed in an earlier episode. For example, in this case we could have also used contains("yr") to use all the columns containing the string “yr”.

Exercise

Create a “wide” table of life_expectancy values, with a row for each year and a column for each country.
Fix the pivoting step in the following code, which should allow making the graph below.

gapminder_clean %>% 
  # retain data from a few years only
  filter(year %in% c(1960, 1970, 1980, 1990, 2010)) %>% 
  # select columns of interest
  select(country, year, life_expectancy_female, life_expectancy_male) %>% 
  # reshape the table
  pivot_longer(cols = FIXME, 
               names_to = "FIXME", 
               values_to = "FIXME") %>% 
  ggplot(aes(factor(year), life_expect)) +
  geom_boxplot(aes(fill = sex))

plot of chunk unnamed-chunk-9

Answer

gapminder_clean %>% 
  select(country, year, life_expectancy) %>% 
  pivot_wider(names_from = "country", values_from = "life_expectancy")

# A tibble: 51 x 194
    year Afghanistan Angola Albania Andorra `United Arab Em… Argentina Armenia
   <dbl>       <dbl>  <dbl>   <dbl>   <dbl>            <dbl>     <dbl>   <dbl>
 1  1960        39.3   40.6    62.2      NA             53.9      64.2    61.9
 2  1961        40.0   41.1    63.3      NA             55.0      64.4    62.2
 3  1962        40.8   41.7    64.2      NA             56.1      64.4    62.6
 4  1963        41.5   42.3    64.9      NA             57.2      64.5    62.9
 5  1964        42.2   42.9    65.4      NA             58.3      64.6    63.3
 6  1965        43.0   43.5    65.8      NA             59.3      64.6    63.7
 7  1966        43.7   44.1    66.1      NA             60.3      64.8    64.0
 8  1967        44.5   44.7    66.3      NA             61.3      64.9    64.4
 9  1968        45.2   45.3    66.4      NA             62.2      65.1    64.7
10  1969        45.9   45.9    66.6      NA             63.1      65.3    65.0
# … with 41 more rows, and 186 more variables: `Antigua and Barbuda` <dbl>,
#   Australia <dbl>, Austria <dbl>, Azerbaijan <dbl>, Burundi <dbl>,
#   Belgium <dbl>, Benin <dbl>, `Burkina Faso` <dbl>, Bangladesh <dbl>,
#   Bulgaria <dbl>, Bahrain <dbl>, Bahamas <dbl>, `Bosnia and
#   Herzegovina` <dbl>, Belarus <dbl>, Belize <dbl>, Bolivia <dbl>,
#   Brazil <dbl>, Barbados <dbl>, Brunei <dbl>, Bhutan <dbl>, Botswana <dbl>,
#   `Central African Republic` <dbl>, Canada <dbl>, Switzerland <dbl>,
#   Chile <dbl>, China <dbl>, `Cote d'Ivoire` <dbl>, Cameroon <dbl>, `Congo,
#   Dem. Rep.` <dbl>, `Congo, Rep.` <dbl>, Colombia <dbl>, Comoros <dbl>, `Cape
#   Verde` <dbl>, `Costa Rica` <dbl>, Cuba <dbl>, Cyprus <dbl>, `Czech
#   Republic` <dbl>, Germany <dbl>, Djibouti <dbl>, Dominica <dbl>,
#   Denmark <dbl>, `Dominican Republic` <dbl>, Algeria <dbl>, Ecuador <dbl>,
#   Egypt <dbl>, Eritrea <dbl>, Spain <dbl>, Estonia <dbl>, Ethiopia <dbl>,
#   Finland <dbl>, Fiji <dbl>, France <dbl>, `Micronesia, Fed. Sts.` <dbl>,
#   Gabon <dbl>, `United Kingdom` <dbl>, Georgia <dbl>, Ghana <dbl>,
#   Guinea <dbl>, Gambia <dbl>, `Guinea-Bissau` <dbl>, `Equatorial
#   Guinea` <dbl>, Greece <dbl>, Grenada <dbl>, Guatemala <dbl>, Guyana <dbl>,
#   Honduras <dbl>, Croatia <dbl>, Haiti <dbl>, Hungary <dbl>, Indonesia <dbl>,
#   India <dbl>, Ireland <dbl>, Iran <dbl>, Iraq <dbl>, Iceland <dbl>,
#   Israel <dbl>, Italy <dbl>, Jamaica <dbl>, Jordan <dbl>, Japan <dbl>,
#   Kazakhstan <dbl>, Kenya <dbl>, `Kyrgyz Republic` <dbl>, Cambodia <dbl>,
#   Kiribati <dbl>, `St. Kitts and Nevis` <dbl>, `South Korea` <dbl>,
#   Kuwait <dbl>, Lao <dbl>, Lebanon <dbl>, Liberia <dbl>, Libya <dbl>, `St.
#   Lucia` <dbl>, `Sri Lanka` <dbl>, Lesotho <dbl>, Lithuania <dbl>,
#   Luxembourg <dbl>, Latvia <dbl>, Morocco <dbl>, Monaco <dbl>, …

gapminder_clean %>% 
  filter(year %in% c(1960, 1970, 1980, 1990, 2010)) %>% 
  select(country, year, life_expectancy_female, life_expectancy_male) %>% 
  pivot_longer(cols = life_expectancy_female:life_expectancy_male, 
               names_to = "sex", values_to = "life_expect") %>% 
  ggplot(aes(factor(year), life_expect)) +
  geom_boxplot(aes(fill = sex))

plot of chunk unnamed-chunk-11

Key Points

Use pivot_wider() to reshape a table from long to wide format.

Use pivot_longer() to reshape a table from wide to long format.

To figure out which data format is more suited for a given analysis, it can help to think about what visualisation you want to make with ggplot: any aesthetics needed to build the graph should exist as columns of your table.

previous episode

Introduction to R/tidyverse for Exploratory Data Analysis

next episode

Data reshaping: from wide to long and back

Overview

Data reshaping

Using the `pivot` functions

Exercise

Answer

Key Points

previous episode

next episode

previous episode

Introduction to R/tidyverse for Exploratory Data Analysis

next episode

Data reshaping: from wide to long and back

Overview

Data reshaping

Using the pivot functions

Exercise

Answer

Key Points

previous episode

next episode

Using the `pivot` functions