# Create 2 data frames with different columns
# and save them as .csv files
# df1
data.frame(x = 1:3, y = c('a', 'b', 'c')) |>
write.csv(file = "df1.csv", row.names = F)
# df2
data.frame(x = 4:6, y = c('d', 'e', 'f')) |>
write.csv(file = "df2.csv", row.names = F)
The problem
Sometimes we collect data from different sources and need to load it all into a single data frame in R. In my research this happens quite often, usually when collecting behavioral data. For example, I tend to use psychopy
a lot to present stimuli. This outputs a separate .csv file for each participant. So, how do you get all of those .csv files into R? In this post, I show you the way I usually do it, as well as a new(er) (to me) method that is more flexible.
Some data
Since the issue we are trying to solve occurs when we have more than one csv file, we will generate some test files to play with.
The data frames look like this:
x | y |
---|---|
1 | a |
2 | b |
3 | c |
x | y |
---|---|
4 | d |
5 | e |
6 | f |
but this is our desired output:
x | y |
---|---|
1 | a |
2 | b |
3 | c |
4 | d |
5 | e |
6 | f |
Solution 1
So, let’s assume we have these two csv files in the root directory of our project folder and we want to load all of them and combine them into a single data frame. Here is a nice two-liner using readr
1:
# Load packages
library("readr")
# Get csv files in project root, read as df
list.files(pattern = ".csv") |>
read_csv()
# A tibble: 6 × 2
x y
<dbl> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
The output is just what we described above. This works quite well for most use cases.
A bigger problem
Now imagine that we also have the following csv files:
# Add z column
data.frame(x = 4:6, y = c('d', 'e', 'f'), z = c(TRUE, TRUE, FALSE)) |>
write.csv(file = "df3.csv", row.names = F)
# Add foo column
data.frame(x = 7:9, y = c('g', 'h', 'i'), foo = c(FALSE, TRUE, FALSE)) |>
write.csv(file = "df4.csv", row.names = F)
If we look at all 4 of them side-by-side, we can see that df3
and df4
have an additional column and they aren’t the same (z
and foo
).
x | y |
---|---|
1 | a |
2 | b |
3 | c |
x | y |
---|---|
4 | d |
5 | e |
6 | f |
x | y | z |
---|---|---|
4 | d | TRUE |
5 | e | TRUE |
6 | f | FALSE |
x | y | foo |
---|---|---|
7 | g | FALSE |
8 | h | TRUE |
9 | i | FALSE |
Even though we are not interested in z
nor foo
, our previous solution will not work becuase of the different column names/lengths.
# Get csv files in project root, read as df
list.files(pattern = ".csv") |>
read_csv()
! Files must all have 2 columns:
* File 3 has 3 columns
Solution 2
The logic is as follows. We can create a list containing all the data frames, select just the columns we want, and then bind them all into a single data frame. We will use purrr
to make this happen.
library("purrr")
# Make vector of the columns you want
<- c("x", "y")
my_cols
# Get a list of all csv files
list.files(pattern = ".csv") |>
as.list() |>
# Read elements of list into a separate list as data frames
map(read_csv) |>
# From each data frame in the list, select only the columns in my_cols, i.e.,
# drop any other columns (z and foo)
map(.f = function(x) {x[, names(x) %in% my_cols]}) |>
# Bind all the data frames in the list to a single data frame
do.call(what = "rbind", args = _)
# A tibble: 12 × 2
x y
<dbl> <chr>
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 6 f
7 4 d
8 5 e
9 6 f
10 7 g
11 8 h
12 9 i
Success!
Here is a template to copy/paste for future me when I forget how I did this:
# Make vector of the columns you want
<- c("col1", "col2")
my_cols
# Get a list of all csv files
# Load them as individual data frames inside a list
# Select only the columns you want (my_cols)
# Bind all the list elements into a single data frame
list.files(pattern = ".csv") |>
as.list() |>
map(read_csv) |>
map(.f = function(x) {x[, names(x) %in% my_cols]}) |>
do.call(what = "rbind", args = _)
Reproducibility information
This document was written in quarto
.
Session info
setting value
version R version 4.3.1 (2023-06-16)
os macOS Sonoma 14.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/New_York
date 2024-03-30
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
loadedversion date
bit 4.0.5 2022-11-15
bit64 4.0.5 2020-08-30
cachem 1.0.8 2023-05-01
cli 3.6.2 2023-12-11
crayon 1.5.2 2022-09-29
devtools 2.4.5 2022-10-11
digest 0.6.35 2024-03-11
ellipsis 0.3.2 2021-04-29
evaluate 0.23 2023-11-01
fansi 1.0.6 2023-12-08
fastmap 1.1.1 2023-02-24
fs 1.6.3 2023-07-20
glue 1.7.0 2024-01-09
hms 1.1.3 2023-03-21
htmltools 0.5.7 2023-11-03
htmlwidgets 1.6.4 2023-12-06
httpuv 1.6.14 2024-01-26
jsonlite 1.8.8 2023-12-04
knitr 1.45 2023-10-30
later 1.3.2 2023-12-06
lifecycle 1.0.4 2023-11-07
magrittr 2.0.3 2022-03-30
memoise 2.0.1 2021-11-26
mime 0.12 2021-09-28
miniUI 0.1.1.1 2018-05-18
pillar 1.9.0 2023-03-22
pkgbuild 1.4.3 2023-12-10
pkgconfig 2.0.3 2019-09-22
pkgload 1.3.4 2024-01-16
profvis 0.3.8 2023-05-02
promises 1.2.1 2023-08-10
purrr 1.0.2 2023-08-10
R6 2.5.1 2021-08-19
Rcpp 1.0.12 2024-01-09
readr 2.1.5 2024-01-10
remotes 2.4.2.1 2023-07-18
rlang 1.1.3 2024-01-10
rmarkdown 2.26 2024-03-05
rstudioapi 0.16.0 2024-03-24
sessioninfo 1.2.2 2021-12-06
shiny 1.8.0 2023-11-17
stringi 1.8.3 2023-12-11
stringr 1.5.1 2023-11-14
tibble 3.2.1 2023-03-20
tidyselect 1.2.1 2024-03-11
tzdb 0.4.0 2023-05-12
urlchecker 1.0.1 2021-11-30
usethis 2.2.3 2024-02-19
utf8 1.2.4 2023-10-22
vctrs 0.6.5 2023-12-01
vroom 1.6.5 2023-12-05
xfun 0.42 2024-02-08
xtable 1.8-4 2019-04-21
yaml 2.3.8 2023-12-11
Footnotes
Note: this strategy won’t work with the base R function
read.csv
because it is not vectorized.↩︎