Accessing and Managing Financial Data

Note

You are reading Tidy Finance with R. You can find the equivalent chapter for the sibling Tidy Finance with Python here.

In this chapter, we suggest a way to organize your financial data. Everybody who has experience with data is also familiar with storing data in various formats like CSV, XLS, XLSX, or other delimited value storage. Reading and saving data can become very cumbersome in the case of using different data formats, both across different projects and across different programming languages. Moreover, storing data in delimited files often leads to problems with respect to column type consistency. For instance, date-type columns frequently lead to inconsistencies across different data formats and programming languages.

This chapter shows how to import different open source data sets. Specifically, our data comes from the application programming interface (API) of Yahoo Finance, a downloaded standard CSV file, an XLSX file stored in a public Google Drive repository, and other macroeconomic time series that can be scraped directly from a website. We show how to process these raw data, as well as how to take a shortcut using the tidyfinance package, which provides a consistent interface to tidy financial data. We store all the data in a single database, which serves as the only source of data in subsequent chapters. We conclude the chapter by providing some tips on managing databases.

First, we load the global R packages that we use throughout this chapter. Later on, we load more packages in the sections where we need them.

library(tidyverse)
library(tidyfinance)
library(scales)

Moreover, we initially define the date range for which we fetch and store the financial data, making future data updates tractable. In case you need another time frame, you can adjust the dates below. Our data starts with 1960 since most asset pricing studies use data from 1962 on.

start_date <- ymd("1960-01-01")
end_date <- ymd("2023-12-31")

Fama-French Data

We start by downloading some famous Fama-French factors (e.g., Fama and French 1993) and portfolio returns commonly used in empirical asset pricing. Fortunately, there is a neat package by Nelson Areal that allows us to access the data easily: the frenchdata package provides functions to download and read data sets from Prof. Kenneth French finance data library (Areal 2021).

library(frenchdata)

We can use the download_french_data() function of the package to download monthly Fama-French factors. The set Fama/French 3 Factors contains the return time series of the market mkt_excess, size smb and value hml alongside the risk-free rates rf. Note that we have to do some manual work to correctly parse all the columns and scale them appropriately, as the raw Fama-French data comes in a very unpractical data format. For precise descriptions of the variables, we suggest consulting Prof. Kenneth French’s finance data library directly. If you are on the website, check the raw data files to appreciate the time you can save thanks to frenchdata.

factors_ff3_monthly_raw <- download_french_data("Fama/French 3 Factors")
factors_ff3_monthly <- factors_ff3_monthly_raw$subsets$data[[1]] |>
  mutate(
    date = floor_date(ymd(str_c(date, "01")), "month"),
    across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),
    .keep = "none"
  ) |>
  rename_with(str_to_lower) |>
  rename(mkt_excess = `mkt-rf`) |> 
  filter(date >= start_date & date <= end_date)

We also download the set 5 Factors (2x3), which additionally includes the return time series of the profitability rmw and investment cma factors. We demonstrate how the monthly factors are constructed in the chapter Replicating Fama and French Factors.

factors_ff5_monthly_raw <- download_french_data("Fama/French 5 Factors (2x3)")

factors_ff5_monthly <- factors_ff5_monthly_raw$subsets$data[[1]] |>
  mutate(
    date = floor_date(ymd(str_c(date, "01")), "month"),
    across(c(RF, `Mkt-RF`, SMB, HML, RMW, CMA), ~as.numeric(.) / 100),
    .keep = "none"
  ) |>
  rename_with(str_to_lower) |>
  rename(mkt_excess = `mkt-rf`) |> 
  filter(date >= start_date & date <= end_date)

It is straightforward to download the corresponding daily Fama-French factors with the same function.

factors_ff3_daily_raw <- download_french_data("Fama/French 3 Factors [Daily]")

factors_ff3_daily <- factors_ff3_daily_raw$subsets$data[[1]] |>
  mutate(
    date = ymd(date),
    across(c(RF, `Mkt-RF`, SMB, HML), ~as.numeric(.) / 100),
    .keep = "none"
  ) |>
  rename_with(str_to_lower) |>
  rename(mkt_excess = `mkt-rf`) |>
  filter(date >= start_date & date <= end_date)

In a subsequent chapter, we also use the 10 monthly industry portfolios, so let us fetch that data, too.

industries_ff_monthly_raw <- download_french_data("10 Industry Portfolios")

industries_ff_monthly <- industries_ff_monthly_raw$subsets$data[[1]] |>
  mutate(date = floor_date(ymd(str_c(date, "01")), "month")) |>
  mutate(across(where(is.numeric), ~ . / 100)) |>
  select(date, everything()) |>
  filter(date >= start_date & date <= end_date) |> 
  rename_with(str_to_lower)

It is worth taking a look at all available portfolio return time series from Kenneth French’s homepage. You should check out the other sets by calling get_french_data_list().

To automatically download and process Fama-French data, you can also use the tidyfinance package with type = "factors_ff_3_monthly" or similar, e.g.:

download_data(
  type = "factors_ff_3_monthly", 
  start_date = start_date, 
  end_date = end_date
)

The tidyfinance package implements the processing steps as above and returns the same cleaned data frame. The list of supported Fama-French data types can be called as follows:

list_supported_types(domain = "Fama-French")

q-Factors

In recent years, the academic discourse experienced the rise of alternative factor models, e.g., in the form of the Hou, Xue, and Zhang (2014) q-factor model. We refer to the extended background information provided by the original authors for further information. The q factors can be downloaded directly from the authors’ homepage from within read_csv().

We also need to adjust this data. First, we discard information we will not use in the remainder of the book. Then, we rename the columns with the “R_”-prescript using regular expressions and write all column names in lowercase. You should always try sticking to a consistent style for naming objects, which we try to illustrate here - the emphasis is on try. You can check out style guides available online, e.g., Hadley Wickham’s tidyverse style guide.

factors_q_monthly_link <-
  "https://global-q.org/uploads/1/2/2/6/122679606/q5_factors_monthly_2023.csv"

factors_q_monthly <- read_csv(factors_q_monthly_link) |>
  mutate(date = ymd(str_c(year, month, "01", sep = "-"))) |>
  rename_with(~str_remove(., "R_")) |>
  rename_with(str_to_lower) |>
  mutate(across(-date, ~. / 100)) |>
  select(date, risk_free = f, mkt_excess = mkt, everything()) |>
  filter(date >= start_date & date <= end_date)

Again, you can use the tidyfinance package for a shortcut:

download_data(
  type = "factors_q5_monthly", 
  start_date = start_date, 
  end_date = end_date
)

Macroeconomic Predictors

Our next data source is a set of macroeconomic variables often used as predictors for the equity premium. Welch and Goyal (2008) comprehensively reexamine the performance of variables suggested by the academic literature to be good predictors of the equity premium. The authors host the data updated to 2022 on Amit Goyal’s website. The data is an XLSX-file stored on a public Google drive location and we directly export a CSV file.

sheet_id <- "1bM7vCWd3WOt95Sf9qjLPZjoiafgF_8EG"
sheet_name <- "Monthly"
macro_predictors_url <- paste0(
  "https://docs.google.com/spreadsheets/d/", sheet_id,
  "/gviz/tq?tqx=out:csv&sheet=", sheet_name
)
macro_predictors_raw <- read_csv(macro_predictors_url)

Next, we transform the columns into the variables that we later use:

The dividend price ratio (dp), the difference between the log of dividends and the log of prices, where dividends are 12-month moving sums of dividends paid on the S&P 500 index, and prices are monthly averages of daily closing prices (Campbell and Shiller 1988; Campbell and Yogo 2006).
Dividend yield (dy), the difference between the log of dividends and the log of lagged prices (Ball 1978).
Earnings price ratio (ep), the difference between the log of earnings and the log of prices, where earnings are 12-month moving sums of earnings on the S&P 500 index (Campbell and Shiller 1988).
Dividend payout ratio (de), the difference between the log of dividends and the log of earnings (Lamont 1998).
Stock variance (svar), the sum of squared daily returns on the S&P 500 index (Guo 2006).
Book-to-market ratio (bm), the ratio of book value to market value for the Dow Jones Industrial Average (Kothari and Shanken 1997).
Net equity expansion (ntis), the ratio of 12-month moving sums of net issues by NYSE listed stocks divided by the total end-of-year market capitalization of NYSE stocks (Campbell, Hilscher, and Szilagyi 2008).
Treasury bills (tbl), the 3-Month Treasury Bill: Secondary Market Rate from the economic research database at the Federal Reserve Bank at St. Louis (Campbell 1987).
Long-term yield (lty), the long-term government bond yield from Ibbotson’s Stocks, Bonds, Bills, and Inflation Yearbook (Welch and Goyal 2008).
Long-term rate of returns (ltr), the long-term government bond returns from Ibbotson’s Stocks, Bonds, Bills, and Inflation Yearbook (Welch and Goyal 2008).
Term spread (tms), the difference between the long-term yield on government bonds and the Treasury bill (Campbell 1987).
Default yield spread (dfy), the difference between BAA and AAA-rated corporate bond yields (Fama and French 1989).
Inflation (infl), the Consumer Price Index (All Urban Consumers) from the Bureau of Labor Statistics (Campbell and Vuolteenaho 2004).

For variable definitions and the required data transformations, you can consult the material on Amit Goyal’s website.

macro_predictors <- macro_predictors_raw |>
  mutate(date = ym(yyyymm)) |>
  mutate(across(where(is.character), as.numeric)) |>
  mutate(
    IndexDiv = Index + D12,
    logret = log(IndexDiv) - log(lag(IndexDiv)),
    Rfree = log(Rfree + 1),
    rp_div = lead(logret - Rfree, 1), # Future excess market return
    dp = log(D12) - log(Index), # Dividend Price ratio
    dy = log(D12) - log(lag(Index)), # Dividend yield
    ep = log(E12) - log(Index), # Earnings price ratio
    de = log(D12) - log(E12), # Dividend payout ratio
    tms = lty - tbl, # Term spread
    dfy = BAA - AAA # Default yield spread
  ) |>
  select(
    date, rp_div, dp, dy, ep, de, svar,
    bm = `b/m`, ntis, tbl, lty, ltr,
    tms, dfy, infl
  ) |>
  filter(date >= start_date & date <= end_date) |>
  drop_na()

To get the equivalent data through tidyfinance, you can call:

download_data(
  type = "macro_predictors_monthly",
  start_date = start_date,
  end_date = end_date
)

Other Macroeconomic Data

The Federal Reserve bank of St. Louis provides the Federal Reserve Economic Data (FRED), an extensive database for macroeconomic data. In total, there are 817,000 US and international time series from 108 different sources. The data can be downloaded directly from FRED by constructing the appropriate URL. For instance, let us consider the consumer price index (CPI) data that can be found under the CPIAUCNS:

series <- "CPIAUCNS"
cpi_url <- paste0(
  "https://fred.stlouisfed.org/graph/fredgraph.csv?id=", series
)

We can then use the httr2 (Wickham 2024) package to request the CSV, extract the data from the response body, and convert the columns to a tidy format:

library(httr2)

cpi_daily <- request(cpi_url) |>
  req_perform() |>
  resp_body_string() |>
  read_csv() |>
  mutate(
    date = as.Date(observation_date),
    value = as.numeric(.data[[series]]),
    series = series,
    .keep = "none"
  )

We convert the daily CPI data to monthly because we use the latter in later chapters.

cpi_monthly <- cpi_daily |>
  mutate(
    date = floor_date(date, "month"),
    cpi = value / value[date == max(date)],
    .keep = "none"
  )

The tidyfinance package can, of course, also fetch the same daily data and many more data series:

download_data(
  type = "fred",
  series = "CPIAUCNS",
  start_date = start_date,
  end_date = end_date
)

# A tibble: 768 × 3
  date       value series  
  <date>     <dbl> <chr>   
1 1960-01-01  29.3 CPIAUCNS
2 1960-02-01  29.4 CPIAUCNS
3 1960-03-01  29.4 CPIAUCNS
4 1960-04-01  29.5 CPIAUCNS
5 1960-05-01  29.5 CPIAUCNS
# ℹ 763 more rows

To download other time series, we just have to look it up on the FRED website and extract the corresponding key from the address. For instance, the producer price index for gold ores can be found under the PCU2122212122210 key. If your desired time series is not supported through tidyfinance, we recommend working with the fredr package (Boysel and Vaughan 2021). Note that you need to get an API key to use its functionality. We refer to the package documentation for details.

Setting Up a Database

Now that we have downloaded some (freely available) data from the web into the memory of our R session let us set up a database to store that information for future use. We will use the data stored in this database throughout the following chapters, but you could alternatively implement a different strategy and replace the respective code.

There are many ways to set up and organize a database, depending on the use case. For our purpose, the most efficient way is to use an SQLite database, which is the C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. Note that SQL (Structured Query Language) is a standard language for accessing and manipulating databases and heavily inspired the dplyr functions. We refer to this tutorial for more information on SQL.

There are two packages that make working with SQLite in R very simple: RSQLite (Müller et al. 2022) embeds the SQLite database engine in R, and dbplyr (Wickham, Girlich, and Ruiz 2022) is the database back-end for dplyr. These packages allow to set up a database to remotely store tables and use these remote database tables as if they are in-memory data frames by automatically converting dplyr into SQL. Check out the RSQLite and dbplyr vignettes for more information.

library(RSQLite)
library(dbplyr)

An SQLite database is easily created - the code below is really all there is. You do not need any external software. Note that we use the extended_types = TRUE option to enable date types when storing and fetching data. Otherwise, date columns are stored and retrieved as integers. We will use the file tidy_finance_r.sqlite, located in the data subfolder, to retrieve data for all subsequent chapters. The initial part of the code ensures that the directory is created if it does not already exist.

if (!dir.exists("data")) {
  dir.create("data")
}

tidy_finance <- dbConnect(
  SQLite(),
  "data/tidy_finance_r.sqlite",
  extended_types = TRUE
)

Next, we create a remote table with the monthly Fama-French factor data. We do so with the function dbWriteTable(), which copies the data to our SQLite-database.

dbWriteTable(
  tidy_finance,
  "factors_ff3_monthly",
  value = factors_ff3_monthly,
  overwrite = TRUE
)

We can use the remote table as an in-memory data frame by building a connection via tbl().

factors_ff3_monthly_db <- tbl(tidy_finance, "factors_ff3_monthly")

All dplyr calls are evaluated lazily, i.e., the data is not in our R session’s memory, and the database does most of the work. You can see that by noticing that the output below does not show the number of rows. In fact, the following code chunk only fetches the top 10 rows from the database for printing.

factors_ff3_monthly_db |>
  select(date, rf)

# Source:   SQL [?? x 2]
# Database: sqlite 3.47.1 [data/tidy_finance_r.sqlite]
  date           rf
  <date>      <dbl>
1 1960-01-01 0.0033
2 1960-02-01 0.0029
3 1960-03-01 0.0035
4 1960-04-01 0.0019
5 1960-05-01 0.0027
# ℹ more rows

If we want to have the whole table in memory, we need to collect() it. You will see that we regularly load the data into the memory in the next chapters.

factors_ff3_monthly_db |>
  select(date, rf) |>
  collect()

# A tibble: 768 × 2
  date           rf
  <date>      <dbl>
1 1960-01-01 0.0033
2 1960-02-01 0.0029
3 1960-03-01 0.0035
4 1960-04-01 0.0019
5 1960-05-01 0.0027
# ℹ 763 more rows

The last couple of code chunks is really all there is to organizing a simple database! You can also share the SQLite database across devices and programming languages.

Before we move on to the next data source, let us also store the other five tables in our new SQLite database.

dbWriteTable(
  tidy_finance,
  "factors_ff5_monthly",
  value = factors_ff5_monthly,
  overwrite = TRUE
)

dbWriteTable(
  tidy_finance,
  "factors_ff3_daily",
  value = factors_ff3_daily,
  overwrite = TRUE
)

dbWriteTable(
  tidy_finance,
  "industries_ff_monthly",
  value = industries_ff_monthly,
  overwrite = TRUE
)

dbWriteTable(
  tidy_finance,
  "factors_q_monthly",
  value = factors_q_monthly,
  overwrite = TRUE
)

dbWriteTable(
  tidy_finance,
  "macro_predictors",
  value = macro_predictors,
  overwrite = TRUE
)

dbWriteTable(
  tidy_finance,
  "cpi_monthly",
  value = cpi_monthly,
  overwrite = TRUE
)

From now on, all you need to do to access data that is stored in the database is to follow three steps: (i) Establish the connection to the SQLite database, (ii) call the table you want to extract, and (iii) collect the data. For your convenience, the following steps show all you need in a compact fashion.

library(tidyverse)
library(RSQLite)

tidy_finance <- dbConnect(
  SQLite(),
  "data/tidy_finance_r.sqlite",
  extended_types = TRUE
)

factors_q_monthly <- tbl(tidy_finance, "factors_q_monthly")
factors_q_monthly <- factors_q_monthly |> collect()

Managing SQLite Databases

Finally, at the end of our data chapter, we revisit the SQLite database itself. When you drop database objects such as tables or delete data from tables, the database file size remains unchanged because SQLite just marks the deleted objects as free and reserves their space for future uses. As a result, the database file always grows in size.

To optimize the database file, you can run the VACUUM command in the database, which rebuilds the database and frees up unused space. You can execute the command in the database using the dbSendQuery() function.

res <- dbSendQuery(tidy_finance, "VACUUM")
res

<SQLiteResult>
  SQL  VACUUM
  ROWS Fetched: 0 [complete]
       Changed: 0

The VACUUM command actually performs a couple of additional cleaning steps, which you can read about in this tutorial.

We store the result of the above query in res because the database keeps the result set open. To close open results and avoid warnings going forward, we can use dbClearResult().

dbClearResult(res)

Apart from cleaning up, you might be interested in listing all the tables that are currently in your database. You can do this via the dbListTables() function.

dbListTables(tidy_finance)

 [1] "beta"                  "compustat"            
 [3] "cpi_monthly"           "crsp_daily"           
 [5] "crsp_monthly"          "factors_ff3_daily"    
 [7] "factors_ff3_monthly"   "factors_ff5_monthly"  
 [9] "factors_q_monthly"     "fisd"                 
[11] "industries_ff_monthly" "macro_predictors"     
[13] "trace_enhanced"

This function comes in handy if you are unsure about the correct naming of the tables in your database.

Key Takeaways

Importing Fama-French factors, q-factors, macroeconomic indicators, and CPI data is simplified through API calls, CSV parsing, and web scraping techniques.
The tidyfinance R package offers pre-processed access to financial datasets, reducing manual data cleaning and saving valuable time.
Creating a centralized SQLite database helps manage and organize data efficiently across projects, while maintaining reproducibility.
Structured database storage supports scalable data access, which is essential for long-term academic projects and collaborative work in finance.

Exercises

Download the monthly Fama-French factors manually from Ken French’s data library and read them in via read_csv(). Validate that you get the same data as via the frenchdata package.
Download the daily Fama-French 5 factors using the frenchdata package. Use get_french_data_list() to find the corresponding table name. After the successful download and conversion to the column format that we used above, compare the rf, mkt_excess, smb, and hml columns of factors_ff3_daily to factors_ff5_daily. Discuss any differences you might find.

References

Areal, Nelson. 2021. frenchdata: Download data sets from Kenneth’s French finance data library site. https://CRAN.R-project.org/package=frenchdata.

Ball, Ray. 1978. “Anomalies in relationships between securities’ yields and yield-surrogates.” Journal of Financial Economics 6 (2–3): 103–26. https://doi.org/10.1016/0304-405X(78)90026-0.

Boysel, Sam, and Davis Vaughan. 2021. fredr: An R client for the ’FRED’ API. https://CRAN.R-project.org/package=fredr.

Campbell, John Y. 1987. “Stock returns and the term structure.” Journal of Financial Economics 18 (2): 373–99. https://doi.org/10.1016/0304-405X(87)90045-6.

Campbell, John Y., Jens Hilscher, and Jan Szilagyi. 2008. “In search of distress risk.” The Journal of Finance 63 (6): 2899–939. https://doi.org/10.1111/j.1540-6261.2008.01416.x.

Campbell, John Y., and Robert J. Shiller. 1988. “Stock prices, earnings, and expected dividends.” The Journal of Finance 43 (3): 661–76. https://doi.org/10.1111/j.1540-6261.1988.tb04598.x.

Campbell, John Y., and Tuomo Vuolteenaho. 2004. “Inflation illusion and stock prices.” American Economic Review 94 (2): 19–23. https://www.aeaweb.org/articles?id=10.1257/0002828041301533.

Campbell, John Y., and Motohiro Yogo. 2006. “Efficient tests of stock return predictability.” Journal of Financial Economics 81 (1): 27–60. https://doi.org/10.1016/j.jfineco.2005.05.008.

Fama, Eugene F., and Kenneth R. French. 1989. “Business conditions and expected returns on stocks and bonds.” Journal of Financial Economics 25 (1): 23–49. https://doi.org/10.1016/0304-405X(89)90095-0.

———. 1993. “Common risk factors in the returns on stocks and bonds.” Journal of Financial Economics 33 (1): 3–56. https://doi.org/10.1016/0304-405X(93)90023-5.

Guo, Hui. 2006. “On the out-of-sample predictability of stock market returns.” The Journal of Business 79 (2): 645–70. https://doi.org/10.1086/499134.

Hou, Kewei, Chen Xue, and Lu Zhang. 2014. “Digesting anomalies: An investment approach.” Review of Financial Studies 28 (3): 650–705. https://doi.org/10.1093/rfs/hhu068.

Kothari, S. P., and Jay A. Shanken. 1997. “Book-to-market, dividend yield, and expected market returns: A time-series analysis.” Journal of Financial Economics 44 (2): 169–203. https://doi.org/10.1016/S0304-405X(97)00002-0.

Lamont, Owen. 1998. “Earnings and expected returns.” The Journal of Finance 53 (5): 1563–87. https://doi.org/10.1111/0022-1082.00065.

Müller, Kirill, Hadley Wickham, David A. James, and Seth Falcon. 2022. RSQLite: SQLite interface for R. https://CRAN.R-project.org/package=RSQLite.

Welch, Ivo, and Amit Goyal. 2008. “A comprehensive look at the empirical performance of equity premium prediction.” Review of Financial Studies 21 (4): 1455–1508. https://doi.org/10.1093/rfs/hhm014.

Wickham, Hadley. 2024. Httr2: Perform HTTP Requests and Process the Responses. https://httr2.r-lib.org.

Wickham, Hadley, Maximilian Girlich, and Edgar Ruiz. 2022. dbplyr: A ’dplyr’ back end for databases. https://CRAN.R-project.org/package=dbplyr.