library(tidyverse)
library(DBI)
library(frenchdata)
library(arrow)
In this short note, I show how one can use DuckDB with WRDS data stored in the PostgreSQL database provided by WRDS. I then use some simple benchmarks to show how DuckDB offers a powerful, fast analytical engine for researchers in accounting and finance.
To make the analysis concrete, I focus on data used in the excellent recent book “Tidy Finance with R”. Essentially, I combine data from CRSP’s daily stock return file (crsp.dsf
) with data on factor returns from Ken French’s website and then run an aggregate query.
Summary of findings
While using DuckDB simplifies the process of collecting data from WRDS (and results in a shorter download time), the real differences come after the data sets are on your computer. Using DuckDB to load the data and run an aggregate query reduces the time taken from over two minutes using dplyr
to well under one second. DuckDB from disk is faster than dplyr
from RAM. Additionally, DuckDB is faster than SQLite. In fact, for many queries DuckDB would be faster collecting data from WRDS than dplyr
is with data in a local SQLite database. While performance isn’t everything, gains like these likely deliver real quality-of-life benefits to data analysts.
I also show that almost all the performance benefits of DuckDB are realized even if the data are stored in parquet files. This is useful information because, while the format of DuckDB database files remains in flux, parquet files are regarded by many as the “modern CSV” and can be read by many software systems, including R and Python. I describe how one could maintain a local library of parquet files including copies of WRDS tables here.
This note illustrates the power of the core Tidy Finance approach. With a few tweaks, one can springboard from the SQLite-and-dplyr
approach of the book to the very cutting-edge of data science tools and approaches.
Databases and tidy data
A popular way to manage and store data is with SQL databases. Tidy Finance with R uses SQLite, which “implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine.” In this note, I use DuckDB, which has been described as offering “SQLite for Analytics”. DuckDB is like SQLite in not requiring a server process, but like server-based databases such as PostgreSQL in terms of support for advanced SQL features and data types.
While storing data in a DuckDB database offers some benefits of SQLite (e.g., data compression), the real benefits of using DuckDB come from using the database engine for data analytics. For the most part, Tidy Finance with R uses SQLite for storage and uses dplyr
and in-memory data frames for analysis. For example, in the chapter on beta estimation, the data are read into memory immediately using collect()
before any analysis is conducted. However, the dbplyr
package allows many analytical tasks to be performed in the database. In this note, I demonstrate how using DuckDB and dbplyr
can lead to significant performance gains.
Getting data
There are two data sets that we need to collect. The first is the factor returns, which we collect from Ken French’s website using the frenchdata
package. The second is from CRSP’s daily stock file, which we get from WRDS.
We start by loading three packages. Note that we load DBI
rather than the underlying database driver package.1 In addition to these three packages, you should have the duckdb
and RSQLite
packages installed. Use install.packages()
in R to install any missing packages.
Next we set up a DuckDB database file in the data
directory, creating this directory if need be. We set read_only = FALSE
because we will want to write to this database connection.
if (!dir.exists("data")) dir.create("data")
<- dbConnect(
tidy_finance ::duckdb(),
duckdb"data/tidy_finance.duckdb",
read_only = FALSE
)
Fama-French factor returns
We use the same start_date
and end_date
values used in “Tidy Finance with R” and the code below also is adapted from that book. However, we use the copy_to()
function from dplyr
to save the table to our database.
<- ymd("1960-01-01")
start_date <- ymd("2021-12-31")
end_date
<-
factors_ff_daily_raw download_french_data("Fama/French 3 Factors [Daily]")
New names:
• `` -> `...1`
<-
factors_ff_daily $subsets$data[[1]] |>
factors_ff_daily_rawmutate(
date = ymd(date),
across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),
.keep = "none"
|>
) rename_with(str_to_lower) |>
rename(mkt_excess = `mkt-rf`) |>
filter(date >= start_date & date <= end_date) |>
copy_to(tidy_finance,
df = _,
name = "factors_ff_daily",
temporary = FALSE,
overwrite = TRUE)
Getting daily returns from WRDS
Next, I specify the connection details as follows. I recommend using environment variables (e.g., set using Sys.setenv()
), as this facilitates sharing code with others. You should not include this chunk of code in your code, rather run it before executing your other code. In addition to setting these environment variables, you may want to set PGPASSWORD
too. (Hopefully it is obvious that your should use your WRDS ID and password, not mine.)
Sys.setenv(PGHOST = "wrds-pgdata.wharton.upenn.edu",
PGPORT = 9737L,
PGDATABASE = "wrds",
PGUSER = Sys.getenv("WRDS_USER"),
PGPASSWORD = Sys.getenv("WRDS_PASSWORD"))
Third, we connect to the CRSP daily stock file in the WRDS PostgreSQL database.
<- dbConnect(RPostgres::Postgres())
pg <- tbl(pg, Id(schema = "crsp", table = "dsf")) dsf_db
As we can see, we have access to data in crsp.dsf
.
dsf_db
# Source: table<"crsp"."dsf"> [?? x 20]
# Database: postgres [pweiss@wrds-pgdata.wharton.upenn.edu:9737/wrds]
cusip permno permco issuno hexcd hsiccd date bidlo askhi
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <date> <dbl> <dbl>
1 68391610 10000 7952 10396 3 3990 1986-01-07 2.38 2.75
2 68391610 10000 7952 10396 3 3990 1986-01-08 2.38 2.62
3 68391610 10000 7952 10396 3 3990 1986-01-09 2.38 2.62
4 68391610 10000 7952 10396 3 3990 1986-01-10 2.38 2.62
5 68391610 10000 7952 10396 3 3990 1986-01-13 2.5 2.75
# ℹ more rows
# ℹ 11 more variables: prc <dbl>, vol <dbl>, ret <dbl>, bid <dbl>,
# ask <dbl>, shrout <dbl>, cfacpr <dbl>, cfacshr <dbl>,
# openprc <dbl>, numtrd <dbl>, retx <dbl>
Before proceeding with our first benchmark, we will make a version of system.time()
that works with assignment.2
<- function(x) {
system_time print(system.time(x))
x }
The following code is adapted from the Tidy Finance code here. But the original code is much more complicated and takes slightly longer to run.3
<- dbExecute(tidy_finance, "DROP TABLE IF EXISTS crsp_daily")
rs
<-
crsp_daily |>
dsf_db filter(between(date, start_date, end_date),
!is.na(ret)) |>
select(permno, date, ret) |>
mutate(month = as.Date(floor_date(date, "month"))) |>
copy_to(tidy_finance, df = _, name = "dsf_temp") |>
left_join(factors_ff_daily |>
select(date, rf), by = "date") |>
mutate(
ret_excess = ret - rf,
ret_excess = pmax(ret_excess, -1, na.rm = TRUE)
|>
) select(permno, date, month, ret_excess) |>
compute(name = "crsp_daily", temporary = FALSE, overwrite = TRUE) |>
system_time()
user system elapsed
56.53 2.64 256.12
Saving data to SQLite
If you have been working through “Tidy Finance”, you may already have an SQLite database containing crsp_daily
. If not, we can easily create one now and copy the table from our DuckDB database to SQLite.
<- dbConnect(
tidy_finance_sqlite ::SQLite(),
RSQLite"data/tidy_finance.sqlite",
extended_types = TRUE
)
copy_to(tidy_finance_sqlite,
crsp_daily,name = "crsp_daily",
overwrite = TRUE,
temporary = FALSE)
dbExecute(tidy_finance_sqlite, "VACUUM")
We can also save the data to a parquet file.
dbExecute(tidy_finance,
"COPY crsp_daily TO 'data/crsp_daily.parquet'
(FORMAT 'PARQUET')")
Having created our two databases, we disconnect from them. This mimics the most common “write-once, read-many” pattern for using databases.
dbDisconnect(tidy_finance_sqlite)
dbDisconnect(tidy_finance, shutdown = TRUE)
Benchmarking a simple aggregation query
The following is a simple comparison of several different ways of doing some basic data analysis with R. After running the code above, we have the table crsp_daily
as described in Tidy Finance in two separate databases—a SQLite database and a DuckDB database—and in a parquet file.
The following examines the same query processed in three different ways.
- Using
dplyr
on an in-memory data frame - Using
dbplyr
with an SQLite database - Using
dbplyr
with a DuckDB database - Using
dbplyr
with DuckDB and a parquet file. - Using
dbplyr
with thearrow
library and a parquet file.
dplyr
We first need to load the data into memory.
<- dbConnect(
tidy_finance ::SQLite(),
RSQLite"data/tidy_finance.sqlite",
extended_types = TRUE
)
<- tbl(tidy_finance, "crsp_daily") crsp_daily
What takes most time is simply loading nearly 2GB of data into memory.
<-
crsp_daily_local |>
crsp_daily collect() |>
system_time()
user system elapsed
208.87 2.75 280.47
Once the data are in memory, it is relatively quick to run a summary query.
|>
crsp_daily_local group_by(month) |>
summarize(ret = mean(ret_excess, na.rm = TRUE)) |>
collect() |>
system_time()
user system elapsed
4.79 1.18 8.08
# A tibble: 744 × 2
month ret
<date> <dbl>
1 1960-01-01 -0.00213
2 1960-02-01 0.000325
3 1960-03-01 -0.00115
4 1960-04-01 -0.00106
5 1960-05-01 0.00114
# ℹ 739 more rows
rm(crsp_daily_local)
dbplyr with SQLite
Things are faster with SQLite, though there’s no obvious way to split the time between reading the data and performing the aggregation. Note that we have a collect()
at the end. This will not take a noticeable amount of time, but seems to be a reasonable step if our plan is to analyse the aggregated data in R.
|>
crsp_daily group_by(month) |>
summarize(ret = mean(ret_excess, na.rm = TRUE)) |>
collect() |>
system_time()
user system elapsed
38.6 11.0 63.5
# A tibble: 744 × 2
month ret
<date> <dbl>
1 1960-01-01 -0.00213
2 1960-02-01 0.000325
3 1960-03-01 -0.00115
4 1960-04-01 -0.00106
5 1960-05-01 0.00114
# ℹ 739 more rows
dbDisconnect(tidy_finance)
dbplyr with DuckDB
Let’s consider DuckDB. Note that we are only reading the data here, so we set read_only = TRUE
in connecting to the database. Apart from the connection, there is no difference between the code here and the code above using SQLite.
<- dbConnect(
tidy_finance ::duckdb(),
duckdb"data/tidy_finance.duckdb",
read_only = TRUE)
<- tbl(tidy_finance, "crsp_daily") crsp_daily
|>
crsp_daily group_by(month) |>
summarize(ret = mean(ret_excess, na.rm = TRUE)) |>
collect() |>
system_time()
user system elapsed
0.51 0.11 0.50
# A tibble: 744 × 2
month ret
<date> <dbl>
1 1991-12-01 0.00237
2 1992-05-01 0.000740
3 1993-01-01 0.00413
4 1993-05-01 0.00265
5 2010-10-01 0.00198
# ℹ 739 more rows
Having done our benchmarks, we can take a quick peek at the data.
|>
crsp_daily group_by(month) |>
summarize(ret = mean(ret_excess, na.rm = TRUE)) |>
arrange(month) |>
collect()
# A tibble: 744 × 2
month ret
<date> <dbl>
1 1960-01-01 -0.00213
2 1960-02-01 0.000325
3 1960-03-01 -0.00115
4 1960-04-01 -0.00106
5 1960-05-01 0.00114
# ℹ 739 more rows
Finally, we disconnect from the database. This will happen automatically if we close R, etc., and is less important if we have read_only = TRUE
(so there is no lock on the file), but we keep things tidy here.
dbDisconnect(tidy_finance, shutdown = TRUE)
dbplyr with DuckDB and a parquet file
Let’s do the benchmark using the parquet data.
<- dbConnect(duckdb::duckdb())
db <- tbl(db, "read_parquet('data/crsp_daily.parquet')") crsp_daily
|>
crsp_daily group_by(month) |>
summarize(ret = mean(ret_excess, na.rm = TRUE)) |>
collect() |>
system_time()
user system elapsed
1.63 0.73 0.51
# A tibble: 744 × 2
month ret
<date> <dbl>
1 1970-01-01 -0.00210
2 1971-05-01 -0.00262
3 1973-02-01 -0.00415
4 1973-11-01 -0.00927
5 1967-05-01 -0.000372
# ℹ 739 more rows
dbDisconnect(db, shutdown = TRUE)
The arrow
library with a parquet file
Let’s do one more benchmark using the parquet data with the arrow
library.
<- open_dataset("data/crsp_daily.parquet") crsp_daily
|>
crsp_daily group_by(month) |>
summarize(ret = mean(ret_excess, na.rm = TRUE)) |>
collect() |>
system_time()
user system elapsed
0.25 0.01 0.76
# A tibble: 744 × 2
month ret
<date> <dbl>
1 1989-01-01 0.00285
2 1986-06-01 0.000194
3 1986-07-01 -0.00352
4 1986-08-01 0.00115
5 1986-09-01 -0.00279
# ℹ 739 more rows
Footnotes
This is how it’s done in “R for Data Science”. I have read comments by Hadley Wickham that this is the right way to do it, but I can’t find those comments.↩︎
If we put
system.time()
at the end of this pipe, thencrsp_daily
would hold the value returned by that function rather than the result of the pipeline preceding it. At first, thesystem_time()
function may seem like magic, but Hadley Wickham explained to me that this works because of lazy evaluation, which is discussed in “Advanced R” here. Essentially,x
is evaluated just once—insidesystem.time()
—and its value is returned in the next line.↩︎Performance will vary according to the speed of your connection to WRDS. Note that this query does temporarily use a significant amount of RAM on my machine, it is not clear that DuckDB will use as much RAM if this is more constrained. If necessary, you can run (say)
dbExecute(tidy_finance, "SET memory_limit='1GB'")
to constrain DuckDB’s memory usage; doing so has little impact on performance for this query.↩︎