In order to facilitate the data analysis pipeline, it is crucial to
have tidy data
. What this means is that every
column in your data frame represents a variable and every row represents
an observation. This is also referred to as long
format (as opposed to wide format).
tidyr
is a package that provides useful functions for
converting raw data into tidy data. This is typically the first step in
the data analysis pipeline after you have collected your data.
This tutorial will focus on step 2 of the process. The main verbs we will use are:
gather()
and spread()
in order to convert
between long and wide dataseparate()
can split up a single column into servaral
variables and is more commonly used in conjunction with
gather()
for linguistic research (i.e. when separating
columns in praat).gather(df, newVar1, newVar2, vector1, vector2)
library(tidyr); library(dplyr)
set.seed(1)
tidyr.ex <- data.frame(
participant = c("p1", "p2", "p3", "p4", "p5", "p6"),
info = c("g1m", "g1m", "g1f", "g2m", "g2m", "g2m"),
day1score = rnorm(n = 6, mean = 80, sd = 15),
day2score = rnorm(n = 6, mean = 88, sd = 8)
)
print(tidyr.ex)
## participant info day1score day2score
## 1 p1 g1m 70.60319 91.89943
## 2 p2 g1m 82.75465 93.90660
## 3 p3 g1f 67.46557 92.60625
## 4 p4 g2m 103.92921 85.55689
## 5 p5 g2m 84.94262 100.09425
## 6 p6 g2m 67.69297 91.11875
tidyr.ex %>%
gather(day, score, c(day1score, day2score))
## participant info day score
## 1 p1 g1m day1score 70.60319
## 2 p2 g1m day1score 82.75465
## 3 p3 g1f day1score 67.46557
## 4 p4 g2m day1score 103.92921
## 5 p5 g2m day1score 84.94262
## 6 p6 g2m day1score 67.69297
## 7 p1 g1m day2score 91.89943
## 8 p2 g1m day2score 93.90660
## 9 p3 g1f day2score 92.60625
## 10 p4 g2m day2score 85.55689
## 11 p5 g2m day2score 100.09425
## 12 p6 g2m day2score 91.11875
Essentially we took the columns day1score
and
day2score
, which represent the variable day
and the variable score
, and gathered them. Why?
Remember that tidy data has one column for each variable and one row for
each observation. The numbers in the two columns we changed were
observations, thus they should each get their own row.
gather()
. The `spread() verb
takes different levels of a factor and spreads them out into different
columns. This means we can convert from long data to wide.tidyr.ex %>%
gather(day, score, c(day1score, day2score)) %>%
spread(day, score)
## participant info day1score day2score
## 1 p1 g1m 70.60319 91.89943
## 2 p2 g1m 82.75465 93.90660
## 3 p3 g1f 67.46557 92.60625
## 4 p4 g2m 103.92921 85.55689
## 5 p5 g2m 84.94262 100.09425
## 6 p6 g2m 67.69297 91.11875
Now we are back to how we started.
separate(df, col, into, sep)
Consider the column info
of our fake data. You can
probably guess what observations represent. How many variables are
there? Take a second to think about it if it doesn’t jump out at you.
The answer is 2. g1
and g2
appear to be a
grouping variable (g = group) and m
f
is an
indication of gender. Because there are two separate variables, there
should be two columns in the data frame… one for group
and
one for gender
.
tidyr.ex %>%
gather(day, score, c(day1score, day2score)) %>%
separate(col = info, into = c("group", "gender"), sep = 2)
## participant group gender day score
## 1 p1 g1 m day1score 70.60319
## 2 p2 g1 m day1score 82.75465
## 3 p3 g1 f day1score 67.46557
## 4 p4 g2 m day1score 103.92921
## 5 p5 g2 m day1score 84.94262
## 6 p6 g2 m day1score 67.69297
## 7 p1 g1 m day2score 91.89943
## 8 p2 g1 m day2score 93.90660
## 9 p3 g1 f day2score 92.60625
## 10 p4 g2 m day2score 85.55689
## 11 p5 g2 m day2score 100.09425
## 12 p6 g2 m day2score 91.11875
tidyr.ex %>%
gather(day, score, c(day1score, day2score)) %>%
separate(col = info, into = c("group", "gender"), sep = 2) %>%
unite(infoAgain, group, gender)
## participant infoAgain day score
## 1 p1 g1_m day1score 70.60319
## 2 p2 g1_m day1score 82.75465
## 3 p3 g1_f day1score 67.46557
## 4 p4 g2_m day1score 103.92921
## 5 p5 g2_m day1score 84.94262
## 6 p6 g2_m day1score 67.69297
## 7 p1 g1_m day2score 91.89943
## 8 p2 g1_m day2score 93.90660
## 9 p3 g1_f day2score 92.60625
## 10 p4 g2_m day2score 85.55689
## 11 p5 g2_m day2score 100.09425
## 12 p6 g2_m day2score 91.11875
Now that our data are tidy (using just the gather()
and
separate()
verbs), we can plot and analyze it.
tidyr.ex %>%
gather(day, score, c(day1score, day2score)) %>%
separate(col = info, into = c("group", "gender"), sep = 2) %>%
ggplot(aes(x = day, y = score)) +
geom_point() +
facet_wrap(~ group) +
geom_smooth(method = "lm", aes(group = 1), se = F)
## `geom_smooth()` using formula = 'y ~ x'
These are the essential verbs used for tidying data. There are other
commands that can be useful, but mainly they are different takes on the
ones we have covered here (i.e. extract()
and
unite()
, which are similar to separate()
and
gather()
, respectively, but use regex).