tidyr tutorial

Overview

In order to facilitate the data analysis pipeline, it is crucial to have tidy data. What this means is that every column in your data frame represents a variable and every row represents an observation. This is also referred to as long format (as opposed to wide format).

tidyr is a package that provides useful functions for converting raw data into tidy data. This is typically the first step in the data analysis pipeline after you have collected your data.

This tutorial will focus on step 2 of the process. The main verbs we will use are:

gather() and spread() in order to convert between long and wide data
separate() can split up a single column into servaral variables and is more commonly used in conjunction with gather() for linguistic research (i.e. when separating columns in praat).

gather()

used to make wide data long
takes columns, and gathers them into key-value pairs
gather(df, newVar1, newVar2, vector1, vector2)

library(tidyr); library(dplyr)

set.seed(1)
tidyr.ex <- data.frame(
  participant = c("p1", "p2", "p3", "p4", "p5", "p6"), 
  info = c("g1m", "g1m", "g1f", "g2m", "g2m", "g2m"),
  day1score = rnorm(n = 6, mean = 80, sd = 15), 
  day2score = rnorm(n = 6, mean = 88, sd = 8)
)

print(tidyr.ex)

##   participant info day1score day2score
## 1          p1  g1m  70.60319  91.89943
## 2          p2  g1m  82.75465  93.90660
## 3          p3  g1f  67.46557  92.60625
## 4          p4  g2m 103.92921  85.55689
## 5          p5  g2m  84.94262 100.09425
## 6          p6  g2m  67.69297  91.11875

tidyr.ex %>%
  gather(day, score, c(day1score, day2score))

##    participant info       day     score
## 1           p1  g1m day1score  70.60319
## 2           p2  g1m day1score  82.75465
## 3           p3  g1f day1score  67.46557
## 4           p4  g2m day1score 103.92921
## 5           p5  g2m day1score  84.94262
## 6           p6  g2m day1score  67.69297
## 7           p1  g1m day2score  91.89943
## 8           p2  g1m day2score  93.90660
## 9           p3  g1f day2score  92.60625
## 10          p4  g2m day2score  85.55689
## 11          p5  g2m day2score 100.09425
## 12          p6  g2m day2score  91.11875

Essentially we took the columns day1score and day2score, which represent the variable day and the variable score, and gathered them. Why? Remember that tidy data has one column for each variable and one row for each observation. The numbers in the two columns we changed were observations, thus they should each get their own row.

spread()

This is a compliment of gather(). The `spread() verb takes different levels of a factor and spreads them out into different columns. This means we can convert from long data to wide.
`spread(df, var1, var2)

tidyr.ex %>%
  gather(day, score, c(day1score, day2score)) %>%
  spread(day, score)

##   participant info day1score day2score
## 1          p1  g1m  70.60319  91.89943
## 2          p2  g1m  82.75465  93.90660
## 3          p3  g1f  67.46557  92.60625
## 4          p4  g2m 103.92921  85.55689
## 5          p5  g2m  84.94262 100.09425
## 6          p6  g2m  67.69297  91.11875

Now we are back to how we started.

separate()

Takes values inside a column and separates them.
Ex. mg1old > m g1 old
separate(df, col, into, sep)

Consider the column info of our fake data. You can probably guess what observations represent. How many variables are there? Take a second to think about it if it doesn’t jump out at you. The answer is 2. g1 and g2 appear to be a grouping variable (g = group) and m f is an indication of gender. Because there are two separate variables, there should be two columns in the data frame… one for group and one for gender.

tidyr.ex %>%
  gather(day, score, c(day1score, day2score)) %>%
  separate(col = info, into = c("group", "gender"), sep = 2)

##    participant group gender       day     score
## 1           p1    g1      m day1score  70.60319
## 2           p2    g1      m day1score  82.75465
## 3           p3    g1      f day1score  67.46557
## 4           p4    g2      m day1score 103.92921
## 5           p5    g2      m day1score  84.94262
## 6           p6    g2      m day1score  67.69297
## 7           p1    g1      m day2score  91.89943
## 8           p2    g1      m day2score  93.90660
## 9           p3    g1      f day2score  92.60625
## 10          p4    g2      m day2score  85.55689
## 11          p5    g2      m day2score 100.09425
## 12          p6    g2      m day2score  91.11875

unite()

Unite does the opposite of spread. In my experience, this is not something that needs to be done very often.
unite(df, newVarName, col1, col2)

tidyr.ex %>%
  gather(day, score, c(day1score, day2score)) %>%
  separate(col = info, into = c("group", "gender"), sep = 2) %>%
  unite(infoAgain, group, gender)

##    participant infoAgain       day     score
## 1           p1      g1_m day1score  70.60319
## 2           p2      g1_m day1score  82.75465
## 3           p3      g1_f day1score  67.46557
## 4           p4      g2_m day1score 103.92921
## 5           p5      g2_m day1score  84.94262
## 6           p6      g2_m day1score  67.69297
## 7           p1      g1_m day2score  91.89943
## 8           p2      g1_m day2score  93.90660
## 9           p3      g1_f day2score  92.60625
## 10          p4      g2_m day2score  85.55689
## 11          p5      g2_m day2score 100.09425
## 12          p6      g2_m day2score  91.11875

Now that our data are tidy (using just the gather() and separate() verbs), we can plot and analyze it.

tidyr.ex %>%
  gather(day, score, c(day1score, day2score)) %>%
  separate(col = info, into = c("group", "gender"), sep = 2) %>%
  ggplot(aes(x = day, y = score)) + 
  geom_point() + 
  facet_wrap(~ group) +
  geom_smooth(method = "lm", aes(group = 1), se = F)

## `geom_smooth()` using formula = 'y ~ x'

These are the essential verbs used for tidying data. There are other commands that can be useful, but mainly they are different takes on the ones we have covered here (i.e. extract() and unite(), which are similar to separate() and gather(), respectively, but use regex).