CRSP 2.0 Update

The highlights of the recent switch to CRSP 2.0 data
Data
R
Python
Authors
Affiliations

Reykjavik University

WU Vienna University of Economics and Business

wikifolio Financial Technologies AG

University of Copenhagen

Danish Finance Institute

Centre for Financial Econometrics, Asset Markets and Macroeconomic Policy at Lancaster University

Published

March 13, 2024

With commit 6acb50b, Tidy Finance has transitioned to a new version of the CRSP tables distributed by WRDS. In this blog post, we will review the changes that affect our routines.

Overview: CRSP and WRDS

Tidy Finance shows how to download the most used data in empirical research in finance from the Wharton Research Data Services (WRDS). A main component of this data is the stock return history provided by the Center for Research in Security Prices (CRSP). CRSP is the de-facto gold standard for stock return data covering historical returns from the 1920s until now.

You can read all about how to connect to WRDS and download the CRSP data in our chapter WRDS, CRSP, and Compustat in the R version and the Python version.

CRSP format 2.0 (CIZ)

In 2022, CRSP rolled out a new format for its data, which is called CRSP’s Stock and Indexes Flat File Format 2.0 (CIZ). You can find detailed information in the official CRSP documentation. This new format is now also available via WRDS. So, what is new in the second version? The most notable items from CRSP’s announcement include:

  • They renamed many variables in the daily and monthly data. Not all of these changes make it easier to understand what the variables are, and some names are hard to read.
  • They added additional variables that are sometimes redundant but make querying simpler (e.g., you do not have to merge different data to perform common tasks).
  • Delisting returns are now included in the main return time series.
  • They introduce issuer-level data, which ensures unique observations for each issuer (e.g., industry codes).
  • They reworked some flag items by giving them alphanumeric values instead of the previous numeric codes. Moreover, these codes are now also documented in metadata files to facilitate access.

Many of these changes mean that you will not be able to simply use the new data with your existing code. Do not worry. We have you covered and will show you how to use the new data in your previous routines easily.

You can access the new data via WRDS using the commands below. You can find more information about the changes at WRDS in SIZ to CIZ Overview and FAQ. For the main data, WRDS basically just added a “_v2” postfix. However, you also see we need fewer tables, as we do not load information on delisting returns separately.

msf_db <- tbl(wrds, in_schema("crsp", "msf_v2"))
stksecurityinfohist_db <- tbl(wrds, in_schema("crsp", "stksecurityinfohist"))

We do not repost the new routines for downloading the data in this blog. Instead, we ask you to check out the new code in the respective chapter on WRDS, CRSP, and Compustat in the R version and the Python version.

A small glimpse at the differences

To take a look at the new table and compare it to the old one, we use the new tidyfinance R package (see our recent blog post on the initial release here) to download the data. If you do not have the package yet, please install it from CRAN using install.packages("tidyfinance"). For ease of use, we also load the full tidyverse as in all our chapters of the R book.

library(tidyfinance)
library(tidyverse)

Then, we can download the two data formats using the function download_data_wrds_crsp().

crsp_v1 <- download_data_wrds_crsp("crsp_monthly", 
                                    version = "v1",
                                    start_date = "1970-01-01",
                                    end_date = "2022-12-01")

crsp_v2 <- download_data_wrds_crsp("crsp_monthly", 
                                    version = "v2",
                                    start_date = "1970-01-01",
                                    end_date = "2022-12-01")

Now, do we find any significant differences between the data? Let us look at the number of observations identified by permno and date.

nrow(crsp_v2)
[1] 3105707
nrow(crsp_v1)
[1] 3103640

We see that the new data contains more rows. However, we are not sure where these observations come from or where we lost them in the old data. Let us see if the observations are actually matched for the identifiers with a anti_join().

anti_join(crsp_v2, crsp_v1,
          by = join_by("permno", "date")) |> 
  nrow()
[1] 2650

We find that roughly 4,000 observations are not matched. However, this is a very minor difference.

Next, we check if the monthly returns are all the same. We compare the new returns to the adjusted returns from version 1.0 to account for delisting in both versions. In fact, these returns should be slightly different, as CRSP states that they have updated the way to compound distributions. They also state this change in compounding as an example, among other things. However, they do not yet provide more information about these other things.

inner_join(crsp_v2 |> 
             select(permno, date, ret_new = ret_excess), 
           crsp_v1 |> 
             select(permno, date, ret_old = ret_excess),
          by = join_by("permno", "date")) |> 
  mutate(ret_differences = !near(ret_new, ret_old, tol = 10^-6)) |> 
  summarize(share_all_ret_differences = sum(ret_differences, na.rm = TRUE) / n())
# A tibble: 1 × 1
  share_all_ret_differences
                      <dbl>
1                     0.123

Indeed, we see that not all returns are exactly matched.

Can we keep using the old CRSP data?

Unfortunately, the old format (also called 1.0 (SIZ)) will not receive updates after the end of the year 2024. Nevertheless, WRDS said that the old data will remain in its current place, so we can continue replicating studies with their original data (ignoring changes to the actual data points). Thus, we have included the option to download the legacy data in our tidyfinance R package.