library(tidyverse)
library(RSQLite)
Since we published our book Tidy Finance with R, we have received feedback from readers who don’t have access to WRDS that they cannot run the code we provide. To alleviate their constraints, we decided to create a dummy database that contains the WRDS tables and corresponding columns such that all code chunks in our book can be executed with this dummy database. The resulting database can be found through this link (around 50 MB). Just download the database and put it into your data
folder (I already renamed it to tidy_finance.sqlite
). Note that we do not create dummy data for macro tables because they can be freely downloaded from the original sources - check out Accessing and Managing Financial Data.
We deliberately use the dummy label because the data is not meaningful in the sense that it allows readers to actually replicate the results of the book. For legal reasons, the data does not contain any samples of the original data. We merely generate random numbers for all columns of the tables that we use throughout the books.
If you want to see the Python code to generate the dummy data, check out the chapter in the appendix of our book: WRDS Dummy Data.
To generate the dummy database, we use the following packages:
Let us initialize a tidy_finance.sqlite
database or connect to your existing one. Be careful, if you already downloaded the data from WRDS, then the code in this blog post will overwrite your data!
<- dbConnect(
tidy_finance SQLite(),
"data/tidy_finance.sqlite",
extended_types = TRUE
)
Since we draw random numbers for most of the columns, we also define a seed to ensure that the generated numbers are replicable. We also initialize vectors of dates of different frequencies over 10 years that we then use to create yearly, monthly, and daily data, respectively.
set.seed(1234)
<- as.Date("2003-01-01")
start_date <- as.Date("2022-12-31")
end_date
<- seq(year(start_date), year(end_date), 1)
time_series_years <- seq(start_date, end_date, "1 month")
time_series_months <- seq(start_date, end_date, "1 day") time_series_days
Create stock dummy data
Let us start with the core data used throughout the book: stock and firm characteristics. We first generate a table with a cross-section of stock identifiers with unique permno
and gvkey
values, as well as associated exchcd
, exchange
, industry
, and siccd
values. The generated data is based on the characteristics of stocks in the crsp_monthly
table of the original database, ensuring that the generated stocks roughly reflect the distribution of industries and exchanges in the original data, but the identifiers and corresponding exchanges or industries do not reflect actual firms. Similarly, the permno
-gvkey
combinations are purely nonsensical and should not be used together with actual CRSP or Compustat data.
<- 100
number_of_stocks
<- tibble(
industries industry = c("Agriculture", "Construction", "Finance",
"Manufacturing", "Mining", "Public", "Retail",
"Services", "Transportation", "Utilities",
"Wholesale"),
n = c(81, 287, 4682, 8584, 1287, 1974, 1571, 4277, 1249,
457, 904),
prob = c(0.00319, 0.0113, 0.185, 0.339, 0.0508, 0.0779,
0.0620, 0.169, 0.0493, 0.0180, 0.0357)
)
<- exchanges <- tibble(
exchanges exchange = c("AMEX", "NASDAQ", "NYSE"),
n = c(2893, 17236, 5553),
prob = c(0.113, 0.671, 0.216)
)
<- 1:number_of_stocks |>
stock_identifiers map_df(
function(x) {
tibble(
permno = x,
gvkey = as.character(x + 10000),
exchange = sample(exchanges$exchange, 1,
prob = exchanges$prob),
industry = sample(industries$industry, 1,
prob = industries$prob)
|>
) mutate(
exchcd = case_when(
== "NYSE" ~ sample(c(1, 31), n()),
exchange == "AMEX" ~ sample(c(2, 32), n()),
exchange == "NASDAQ" ~ sample(c(3, 33), n())
exchange
),siccd = case_when(
== "Agriculture" ~ sample(1:999, n()),
industry == "Mining" ~ sample(1000:1499, n()),
industry == "Construction" ~ sample(1500:1799, n()),
industry == "Manufacturing" ~ sample(1800:3999, n()),
industry == "Transportation" ~ sample(4000:4899, n()),
industry == "Utilities" ~ sample(4900:4999, n()),
industry == "Wholesale" ~ sample(5000:5199, n()),
industry == "Retail" ~ sample(5200:5999, n()),
industry == "Finance" ~ sample(6000:6799, n()),
industry == "Services" ~ sample(7000:8999, n()),
industry == "Public" ~ sample(9000:9999, n())
industry
)
)
} )
Next, we construct three panels of stock data with varying frequencies: yearly, monthly, and daily. We begin by creating the stock_panel_yearly
panel. To achieve this, we combine the stock_identifiers
table with a new table containing the variable year
from time_series_years
. The expand_grid()
function ensures that we get all possible combinations of the two tables. After combining, we select only the gvkey
and year
columns for our final yearly panel.
Next, we construct the stock_panel_monthly
panel. Similar to the yearly panel, we use the expand_grid()
function to combine stock_identifiers
with a new table that has the month
variable from time_series_months
. After merging, we select the columns permno
, gvkey
, month
, siccd
, industry
, exchcd
, and exchange
to form our monthly panel.
Lastly, we create the stock_panel_daily
panel. We combine stock_identifiers
with a table containing the date
variable from time_series_days
. After merging, we retain only the permno
and date
columns for our daily panel.
<- expand_grid(
stock_panel_yearly
stock_identifiers, tibble(year = time_series_years)
|>
) select(gvkey, year)
<- expand_grid(
stock_panel_monthly
stock_identifiers, tibble(month = time_series_months)
|>
) select(permno, gvkey, month, siccd, industry, exchcd, exchange)
<- expand_grid(
stock_panel_daily
stock_identifiers, tibble(date = time_series_days)
|>
)select(permno, date)
Dummy beta
table
We then proceed to create dummy beta values for our stock_panel_monthly
table. We generate monthly beta values beta_monthly
using the rnorm()
function with a mean and standard deviation of 1. For daily beta values beta_daily
, we take the dummy monthly beta and add a small random noise to it. This noise is generated again using the rnorm()
function, but this time we divide the random values by 100 to ensure they are small deviations from the monthly beta.
<- stock_panel_monthly |>
beta_dummy mutate(
beta_monthly = rnorm(n(), mean = 1, sd = 1),
beta_daily = beta_monthly + rnorm(n()) / 100
)
dbWriteTable(
tidy_finance,"beta",
beta_dummy, overwrite = TRUE
)
Dummy compustat
table
To create dummy firm characteristics, we take all columns from the compustat
table and create random numbers between 0 and 1. For simplicity, we set the datadate
for each firm-year observation to the last day of the year, although it is empirically not the case. We then use the !!!
operator to unlist and execute a list of commands. This trick actually helps us to avoid typing the same function for each column individually.
<- c(
relevant_columns "seq", "ceq", "at", "lt", "txditc", "txdb", "itcb",
"pstkrv", "pstkl", "pstk", "capx", "oancf", "sale",
"cogs", "xint", "xsga", "be", "op", "at_lag", "inv"
)
<- unlist(
commands map(
relevant_columns, ~rlang::exprs(!!..1 := runif(n()))
)
)
<- stock_panel_yearly |>
compustat_dummy mutate(
datadate = ymd(str_c(year, "12", "31")),
!!!commands
)
dbWriteTable(
tidy_finance, "compustat",
compustat_dummy,overwrite = TRUE
)
Dummy crsp_monthly
table
The crsp_monthly
table only lacks a few more columns compared to stock_panel_monthly
: the returns ret
drawn from a normal distribution, the excess returns ret_excess
with small deviations from the returns, the shares outstanding shrout
and the last price per month altprc
both drawn from uniform distributions, and the market capitalization mktcap
as the product of shrout
and altprc
.
<- stock_panel_monthly |>
crsp_monthly_dummy mutate(
date = ceiling_date(month, "month") - 1,
ret = pmax(rnorm(n()), -1),
ret_excess = pmax(ret - runif(n(), 0, 0.0025), -1),
shrout = runif(n(), 1, 50) * 1000,
altprc = runif(n(), 0, 1000),
mktcap = shrout * altprc
|>
) group_by(permno) |>
arrange(month) |>
mutate(mktcap_lag = lag(mktcap)) |>
ungroup()
dbWriteTable(
tidy_finance, "crsp_monthly",
crsp_monthly_dummy,overwrite = TRUE
)
Dummy crsp_daily
table
The crsp_daily
table only contains a month
column and the daily excess returns ret_excess
as additional columns to stock_panel_daily
.
<- stock_panel_daily |>
crsp_daily_dummy mutate(
month = floor_date(date, "month"),
ret_excess = pmax(rnorm(n()), -1)
)
dbWriteTable(
tidy_finance,"crsp_daily",
crsp_daily_dummy, overwrite = TRUE
)
Create bond dummy data
Lastly, we move to the bond data that we use in our books.
Dummy fisd
data
To create dummy data with the structure of Mergent FISD, we calculate the empirical probabilities of actual bonds for several variables: maturity
, offering_amt
, interest_frequency
, coupon
, and sic_code
. We use these probabilities to sample a small cross-section of bonds with completely made up complete_cusip
, issue_id
, and issuer_id
.
<- 100
number_of_bonds
<- 1:number_of_bonds |>
fisd_dummy map_df(
function(x) {
tibble(
complete_cusip = str_to_upper(
str_c(
sample(c(letters, 0:9), 12, replace = TRUE),
collapse = ""
)
),
)
}|>
) mutate(
maturity = sample(time_series_days, n(), replace = TRUE),
offering_amt = sample(seq(1:100) * 100000, n(), replace = TRUE),
offering_date = maturity - sample(seq(1:25) * 365, n(),replace = TRUE),
dated_date = offering_date - sample(-10:10, n(), replace = TRUE),
interest_frequency = sample(c(0, 1, 2, 4, 12), n(), replace = TRUE),
coupon = sample(seq(0, 2, by = 0.1), n(), replace = TRUE),
last_interest_date = pmax(maturity, offering_date, dated_date),
issue_id = row_number(),
issuer_id = sample(1:250, n(), replace = TRUE),
sic_code = as.character(sample(seq(1:9)*1000, n(), replace = TRUE))
)
dbWriteTable(
tidy_finance, "fisd",
fisd_dummy, overwrite = TRUE
)
Dummy trace_enhanced
data
Finally, we create a dummy bond transaction data for the fictional CUSIPs of the dummy fisd
data. We take the date range that we also analyze in the book and ensure that we have at least five transactions per day to fulfill a filtering step in the book.
<- as.Date("2014-01-01")
start_date <- as.Date("2016-11-30")
end_date
<- expand_grid(
bonds_panel |>
fisd_dummy select(cusip_id = complete_cusip),
tibble(
trd_exctn_dt = seq(start_date, end_date, "1 day")
)
)
<- bind_rows(
trace_enhanced_dummy
bonds_panel, bonds_panel,
bonds_panel, bonds_panel, |>
bonds_panel) mutate(
trd_exctn_tm = str_c(
sample(0:24, n(), replace = TRUE), ":",
sample(0:60, n(), replace = TRUE), ":",
sample(0:60, n(), replace = TRUE)
),rptd_pr = runif(n(), 10, 200),
entrd_vol_qt = sample(1:20, n(), replace = TRUE) * 1000,
yld_pt = runif(n(), -10, 10),
rpt_side_cd = sample(c("B", "S"), n(), replace = TRUE),
cntra_mp_id = sample(c("C", "D"), n(), replace = TRUE)
)
dbWriteTable(
tidy_finance, "trace_enhanced",
trace_enhanced_dummy, overwrite = TRUE
)
As stated in the introduction, the data does not contain any samples of the original data. We merely generate random numbers for all columns of the tables that we use throughout the books. You can find the database with the dummy data here.