The problem

Sometimes we collect data from different sources and need to load it all into a single data frame in R. In my research this happens quite often, usually when collecting behavioral data. For example, I tend to use psychopy a lot to present stimuli. This outputs a separate .csv file for each participant. So, how do you get all of those .csv files into R? In this post, I show you the way I usually do it, as well as a new(er) (to me) method that is more flexible.

Some data

Since the issue we are trying to solve occurs when we have more than one csv file, we will generate some test files to play with.

# Create 2 data frames with different columns
# and save them as .csv files

# df1
data.frame(x = 1:3, y = c('a', 'b', 'c')) |> 
  write.csv(file = "df1.csv", row.names = F)

# df2
data.frame(x = 4:6, y = c('d', 'e', 'f')) |> 
  write.csv(file = "df2.csv", row.names = F)

The data frames look like this:

Table 1: Two toy data frames

(a) df1

x	y
1	a
2	b
3	c

(b) df2

x	y
4	d
5	e
6	f

but this is our desired output:

Table 2: Combined data frames

x	y
1	a
2	b
3	c
4	d
5	e
6	f

Solution 1

So, let’s assume we have these two csv files in the root directory of our project folder and we want to load all of them and combine them into a single data frame. Here is a nice two-liner using readr¹:

# Load packages
library("readr")

# Get csv files in project root, read as df
list.files(pattern = ".csv") |> 
  read_csv()

# A tibble: 6 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
5     5 e    
6     6 f

The output is just what we described above. This works quite well for most use cases.

A bigger problem

Now imagine that we also have the following csv files:

# Add z column
data.frame(x = 4:6, y = c('d', 'e', 'f'), z = c(TRUE, TRUE, FALSE)) |> 
  write.csv(file = "df3.csv", row.names = F)

# Add foo column
data.frame(x = 7:9, y = c('g', 'h', 'i'), foo = c(FALSE, TRUE, FALSE)) |> 
  write.csv(file = "df4.csv", row.names = F)

If we look at all 4 of them side-by-side, we can see that df3 and df4 have an additional column and they aren’t the same (z and foo).

Table 3: 4 toy data frames

(a) df1

x	y
1	a
2	b
3	c

(b) df2

x	y
4	d
5	e
6	f

x	y	z
4	d	TRUE
5	e	TRUE
6	f	FALSE

(d) df4

x	y	foo
7	g	FALSE
8	h	TRUE
9	i	FALSE

Even though we are not interested in z nor foo, our previous solution will not work becuase of the different column names/lengths.

# Get csv files in project root, read as df
list.files(pattern = ".csv") |> 
  read_csv()

! Files must all have 2 columns:
* File 3 has 3 columns

Solution 2

The logic is as follows. We can create a list containing all the data frames, select just the columns we want, and then bind them all into a single data frame. We will use purrr to make this happen.

library("purrr")

# Make vector of the columns you want
my_cols <- c("x", "y")

# Get a list of all csv files
list.files(pattern = ".csv") |> 
  as.list() |> 
  # Read elements of list into a separate list as data frames
  map(read_csv) |> 
  # From each data frame in the list, select only the columns in my_cols, i.e., 
  # drop any other columns (z and foo)
  map(.f = function(x) {x[, names(x) %in% my_cols]}) |> 
  # Bind all the data frames in the list to a single data frame
  do.call(what = "rbind", args = _)

# A tibble: 12 × 2
       x y    
   <dbl> <chr>
 1     1 a    
 2     2 b    
 3     3 c    
 4     4 d    
 5     5 e    
 6     6 f    
 7     4 d    
 8     5 e    
 9     6 f    
10     7 g    
11     8 h    
12     9 i

Success!

Here is a template to copy/paste for future me when I forget how I did this:

# Make vector of the columns you want
my_cols <- c("col1", "col2")

# Get a list of all csv files
# Load them as individual data frames inside a list
# Select only the columns you want (my_cols)
# Bind all the list elements into a single data frame
list.files(pattern = ".csv") |> 
  as.list() |> 
  map(read_csv) |> 
  map(.f = function(x) {x[, names(x) %in% my_cols]}) |> 
  do.call(what = "rbind", args = _)

Reproducibility information

This document was written in quarto.

Session info

 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Sonoma 14.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/New_York
 date     2024-03-30
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

            loadedversion       date
bit                 4.0.5 2022-11-15
bit64               4.0.5 2020-08-30
cachem              1.0.8 2023-05-01
cli                 3.6.2 2023-12-11
crayon              1.5.2 2022-09-29
devtools            2.4.5 2022-10-11
digest             0.6.35 2024-03-11
ellipsis            0.3.2 2021-04-29
evaluate             0.23 2023-11-01
fansi               1.0.6 2023-12-08
fastmap             1.1.1 2023-02-24
fs                  1.6.3 2023-07-20
glue                1.7.0 2024-01-09
hms                 1.1.3 2023-03-21
htmltools           0.5.7 2023-11-03
htmlwidgets         1.6.4 2023-12-06
httpuv             1.6.14 2024-01-26
jsonlite            1.8.8 2023-12-04
knitr                1.45 2023-10-30
later               1.3.2 2023-12-06
lifecycle           1.0.4 2023-11-07
magrittr            2.0.3 2022-03-30
memoise             2.0.1 2021-11-26
mime                 0.12 2021-09-28
miniUI            0.1.1.1 2018-05-18
pillar              1.9.0 2023-03-22
pkgbuild            1.4.3 2023-12-10
pkgconfig           2.0.3 2019-09-22
pkgload             1.3.4 2024-01-16
profvis             0.3.8 2023-05-02
promises            1.2.1 2023-08-10
purrr               1.0.2 2023-08-10
R6                  2.5.1 2021-08-19
Rcpp               1.0.12 2024-01-09
readr               2.1.5 2024-01-10
remotes           2.4.2.1 2023-07-18
rlang               1.1.3 2024-01-10
rmarkdown            2.26 2024-03-05
rstudioapi         0.16.0 2024-03-24
sessioninfo         1.2.2 2021-12-06
shiny               1.8.0 2023-11-17
stringi             1.8.3 2023-12-11
stringr             1.5.1 2023-11-14
tibble              3.2.1 2023-03-20
tidyselect          1.2.1 2024-03-11
tzdb                0.4.0 2023-05-12
urlchecker          1.0.1 2021-11-30
usethis             2.2.3 2024-02-19
utf8                1.2.4 2023-10-22
vctrs               0.6.5 2023-12-01
vroom               1.6.5 2023-12-05
xfun                 0.42 2024-02-08
xtable              1.8-4 2019-04-21
yaml                2.3.8 2023-12-11

Footnotes

Note: this strategy won’t work with the base R function read.csv because it is not vectorized.↩︎