import pandas as pd
import numpy as np
import sqlite3
import pyfixest as pf
from plotnine import *
from scipy.stats import norm
from mizani.breaks import date_breaks
from mizani.formatters import date_format
Difference in Differences
You are reading Tidy Finance with Python. You can find the equivalent chapter for the sibling Tidy Finance with R here.
In this chapter, we illustrate the concept of difference in differences (DD) estimators by evaluating the effects of climate change regulation on the pricing of bonds across firms. DD estimators are typically used to recover the treatment effects of natural or quasi-natural experiments that trigger sharp changes in the environment of a specific group. Instead of looking at differences in just one group (e.g., the effect in the treated group), DD investigates the treatment effects by looking at the difference between differences in two groups. Such experiments are usually exploited to address endogeneity concerns (e.g., Roberts and Whited 2013). The identifying assumption is that the outcome variable would change equally in both groups without the treatment. This assumption is also often referred to as the assumption of parallel trends. Moreover, we would ideally also want a random assignment to the treatment and control groups. Due to lobbying or other activities, this randomness is often violated in (financial) economics.
In the context of our setting, we investigate the impact of the Paris Agreement (PA), signed on December 12, 2015, on the bond yields of polluting firms. We first estimate the treatment effect of the agreement using panel regression techniques that we discuss in Fixed Effects and Clustered Standard Errors. We then present two methods to illustrate the treatment effect over time graphically. Although we demonstrate that the treatment effect of the agreement is anticipated by bond market participants well in advance, the techniques we present below can also be applied to many other settings.
The approach we use here replicates the results of Seltzer, Starks, and Zhu (2022) partly. Specifically, we borrow their industry definitions for grouping firms into green and brown types. Overall, the literature on environmental, social, and governance (ESG) effects in corporate bond markets is already large but continues to grow (for recent examples, see, e.g., Halling, Yu, and Zechner (2021), Handler, Jankowitsch, and Pasler (2022), Huynh and Xia (2021), among many others).
The current chapter relies on this set of Python packages.
Compared to previous chapters, we introduce the scipy.stats
module from the scipy
(Virtanen et al. 2020) for simple retrieval of quantiles of the standard normal distribution.
Data Preparation
We use TRACE and Mergent FISD as data sources from our SQLite database introduced in Accessing and Managing Financial Data and TRACE and FISD.
= sqlite3.connect(database="data/tidy_finance_python.sqlite")
tidy_finance
= (pd.read_sql_query(
fisd ="SELECT complete_cusip, maturity, offering_amt, sic_code FROM fisd",
sql=tidy_finance,
con={"maturity"})
parse_dates
.dropna()
)
= (pd.read_sql_query(
trace_enhanced =("SELECT cusip_id, trd_exctn_dt, rptd_pr, entrd_vol_qt, yld_pt "
sql"FROM trace_enhanced"),
=tidy_finance,
con={"trd_exctn_dt"})
parse_dates
.dropna() )
We start our analysis by preparing the sample of bonds. We only consider bonds with a time to maturity of more than one year to the signing of the PA, so that we have sufficient data to analyze the yield behavior after the treatment date. This restriction also excludes all bonds issued after the agreement. We also consider only the first two digits of the SIC industry code to identify the polluting industries (in line with Seltzer, Starks, and Zhu 2022).
= pd.to_datetime("2015-12-12")
treatment_date = [
polluting_industries 49, 13, 45, 29, 28, 33, 40, 20, 26, 42, 10, 53, 32, 99, 37
]
= (fisd
bonds "offering_amt > 0 & sic_code != 'None'")
.query(
.assign(=lambda x: (x["maturity"]-treatment_date).dt.days / 365,
time_to_maturity=lambda x: x["sic_code"].astype(str).str[:2].astype(int),
sic_code=lambda x: np.log(x["offering_amt"])
log_offering_amt
)"time_to_maturity >= 1")
.query(={"complete_cusip": "cusip_id"})
.rename(columns"cusip_id", "time_to_maturity", "log_offering_amt", "sic_code"])
.get([=lambda x: x["sic_code"].isin(polluting_industries))
.assign(polluter=True)
.reset_index(drop )
Next, we aggregate the individual transactions as reported in TRACE to a monthly panel of bond yields. We consider bond yields for a bond’s last trading day in a month. Therefore, we first aggregate bond data to daily frequency and apply common restrictions from the literature (see, e.g., Bessembinder et al. 2008). We weigh each transaction by volume to reflect a trade’s relative importance and avoid emphasizing small trades. Moreover, we only consider transactions with reported prices rptd_pr
larger than 25 (to exclude bonds that are close to default) and only bond-day observations with more than five trades on a corresponding day (to exclude prices based on too few, potentially non-representative transactions).
= (trace_enhanced
trace_enhanced "rptd_pr > 25")
.query(=lambda x: x["entrd_vol_qt"]*x["rptd_pr"])
.assign(weight=lambda x: x["weight"]*x["yld_pt"])
.assign(weighted_yield
)
= (trace_enhanced
trace_aggregated "cusip_id", "trd_exctn_dt"])
.groupby([
.aggregate(=("weighted_yield", "sum"),
weighted_yield_sum=("weight", "sum"),
weight_sum=("rptd_pr", "count")
trades
)
.reset_index()=lambda x: x["weighted_yield_sum"]/x["weight_sum"])
.assign(avg_yield=["avg_yield"])
.dropna(subset"trades >= 5")
.query(=lambda x: pd.to_datetime(x["trd_exctn_dt"]))
.assign(trd_exctn_dt=lambda x: x["trd_exctn_dt"]-pd.offsets.MonthBegin())
.assign(date
)
= (trace_aggregated
date_index "cusip_id", "date"])["trd_exctn_dt"]
.groupby([
.idxmax()
)
= (trace_aggregated
trace_aggregated
.loc[date_index]"cusip_id", "date", "avg_yield"])
.get([ )
By combining the bond-specific information from Mergent FISD for our bond sample with the aggregated TRACE data, we arrive at the main sample for our analysis.
= (bonds
bonds_panel ="inner", on="cusip_id")
.merge(trace_aggregated, how
.dropna() )
Before we can run the first regression, we need to define the treated
indicator,1 which is the product of the post_period
(i.e., all months after the signing of the PA) and the polluter
indicator defined above.
= (bonds_panel
bonds_panel
.assign(=lambda x: (
post_period"date"] >= (treatment_date-pd.offsets.MonthBegin())
x[
)
)=lambda x: x["polluter"] & x["post_period"])
.assign(treated=lambda x: pd.Categorical(x["date"], ordered=True))
.assign(date_cat )
As usual, we tabulate summary statistics of the variables that enter the regression to check the validity of our variable definitions.
= (bonds_panel
bonds_panel_summary ="measure",
.melt(var_name=["avg_yield", "time_to_maturity", "log_offering_amt"])
value_vars"measure")
.groupby(=[0.05, 0.5, 0.95])
.describe(percentiles
)round(bonds_panel_summary, 2) np.
value | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 5% | 50% | 95% | max | |
measure | ||||||||
avg_yield | 127546.0 | 4.08 | 4.21 | 0.06 | 1.27 | 3.38 | 8.11 | 127.97 |
log_offering_amt | 127546.0 | 13.27 | 0.82 | 4.64 | 12.21 | 13.22 | 14.51 | 16.52 |
time_to_maturity | 127546.0 | 8.55 | 8.41 | 1.01 | 1.50 | 5.81 | 27.41 | 100.70 |
Panel Regressions
The PA is a legally binding international treaty on climate change. It was adopted by 196 parties at COP 21 in Paris on December 12, 2015 and entered into force on November 4, 2016. The PA obliges developed countries to support efforts to build clean, climate-resilient futures. One may thus hypothesize that adopting climate-related policies may affect financial markets. To measure the magnitude of this effect, we first run an ordinary least square (OLS) regression without fixed effects where we include the treated
, post_period
, and polluter
dummies, as well as the bond-specific characteristics log_offering_amt
and time_to_maturity
. This simple model assumes that there are essentially two periods (before and after the PA) and two groups (polluters and non-polluters). Nonetheless, it should indicate whether polluters have higher yields following the PA compared to non-polluters.
The second model follows the typical DD regression approach by including individual (cusip_id
) and time (date
) fixed effects. In this model, we do not include any other variables from the simple model because the fixed effects subsume them, and we observe the coefficient of our main variable of interest: treated
.
= pf.feols(
model_without_fe "avg_yield ~ treated + post_period + polluter + log_offering_amt + time_to_maturity",
= "iid",
vcov = bonds_panel
data
)
= pf.feols(
model_with_fe "avg_yield ~ treated | cusip_id + date",
= "iid",
vcov = bonds_panel
data
)
= "b (t)") pf.etable([model_without_fe, model_with_fe], coef_fmt
est1 est2
---------------- ------------------- -----------------
depvar avg_yield avg_yield
--------------------------------------------------------
Intercept 10.733*** (57.058)
treated 0.453*** (9.136) 0.975*** (30.123)
post_period -0.178*** (-6.041)
polluter 0.486*** (15.427)
log_offering_amt -0.550*** (-38.991)
time_to_maturity 0.058*** (41.527)
--------------------------------------------------------
cusip_id - x
date - x
--------------------------------------------------------
R2 0.032 0.648
S.E. type iid iid
Observations 127546 127546
--------------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (t-stats)
Both models indicate that polluters have significantly higher yields after the PA than non-polluting firms. Note that the magnitude of the treated
coefficient varies considerably across models.
Visualizing Parallel Trends
Even though the regressions above indicate that there is an impact of the PA on bond yields of polluters, the tables do not tell us anything about the dynamics of the treatment effect. In particular, the models provide no indication about whether the crucial parallel trends assumption is valid. This assumption requires that in the absence of treatment, the difference between the two groups is constant over time. Although there is no well-defined statistical test for this assumption, visual inspection typically provides a good indication.
To provide such visual evidence, we revisit the simple OLS model and replace the treated
and post_period
indicators with month dummies for each group. This approach estimates the average yield change of both groups for each period and provides corresponding confidence intervals. Plotting the coefficient estimates for both groups around the treatment date shows us the dynamics of our panel data.
= pf.feols(
model_without_fe_time "avg_yield ~ polluter + date*polluter + time_to_maturity + log_offering_amt",
= "iid",
vcov = bonds_panel.assign(date = lambda x: x["date"].astype(str))
data
)
= (pd.DataFrame({
model_without_fe_coefs "estimate": model_without_fe_time.coef(),
"std_error": model_without_fe_time.se()
})="term")
.reset_index(names"term.str.contains('date')")
.query(
.assign(=lambda x: x["term"].str.contains(":polluter"),
treatment=lambda x: x["term"].str.extract(r"date\[T\.(\d{4}-\d{2}-\d{2})\]"),
date=lambda x: x["estimate"]+norm.ppf(0.975)*x["std_error"],
ci_up=lambda x: x["estimate"]+norm.ppf(0.025)*x["std_error"]
ci_low
)
)
= (
polluters_plot
ggplot(model_without_fe_coefs, ="date", y="estimate",
aes(x="treatment", linetype="treatment", shape="treatment")) +
color=pd.to_datetime(treatment_date) -
geom_vline(xintercept="dashed") +
pd.offsets.MonthBegin(), linetype=0, linetype="dashed") +
geom_hline(yintercept="ci_low", ymax="ci_up"), alpha=0.5) +
geom_errorbar(aes(ymin+
geom_point() =None) +
guides(linetype="", y="Yield", shape="Polluter?", color="Polluter?",
labs(x="Polluters respond stronger than green firms") +
title=["solid", "dashed"]) +
scale_linetype_manual(values=date_breaks("1 year"), labels=date_format("%Y"))
scale_x_datetime(breaks
) polluters_plot.draw()
Figure 1 shows that throughout most of 2014, the yields of the two groups changed in unison. However, starting at the end of 2014, the yields start to diverge, reaching the highest difference around the signing of the PA. Afterward, the yields for both groups fall again, and the polluters arrive at the same level as at the beginning of 2014. The non-polluters, on the other hand, even experience significantly lower yields than polluters after the signing of the agreement.
Instead of plotting both groups using the simple model approach, we can also use the fixed-effects model and focus on the polluter’s yield response to the signing relative to the non-polluters. To perform this estimation, we need to replace the treated
indicator with separate time dummies for the polluters, each marking a one-month period relative to the treatment date.
= (bonds_panel
bonds_panel_alt
.assign(=lambda x: (
diff_to_treatmentround(
np."date"]-(treatment_date-
((x[/365)*12, 0
pd.offsets.MonthBegin())).dt.daysint)
).astype(
)
)
)
= (bonds_panel_alt
variables "diff_to_treatment", "date"])
.get([
.drop_duplicates()"date")
.sort_values(
.copy()=np.nan)
.assign(variable_name=True)
.reset_index(drop )
In the next code chunk, we assemble the model formula and regress the monthly yields on the set of time dummies and cusip_id
and date
fixed effects.
= "avg_yield ~ 1 + "
formula
for j in range(variables.shape[0]):
if variables["diff_to_treatment"].iloc[j] != 0:
=list(bonds_panel_alt.columns)
old_names
"new_var"] = (
bonds_panel_alt["diff_to_treatment"] ==
bonds_panel_alt["diff_to_treatment"].iloc[j]
variables[& bonds_panel_alt["polluter"]
)
=variables["diff_to_treatment"].iloc[j]
diff_to_treatment_value="lag" if diff_to_treatment_value < 0 else "lead"
direction=int(abs(diff_to_treatment_value))
abs_diff_to_treatment=f"{direction}{abs_diff_to_treatment}"
new_var_name"variable_name"]=new_var_name
variables.at[j, =bonds_panel_alt["new_var"]
bonds_panel_alt[new_var_name]+= (f" + {new_var_name}" if j > 0 else new_var_name)
formula
= formula + " | cusip_id + date "
formula
= pf.feols(
model_with_fe_time
formula,= "iid",
vcov = bonds_panel_alt
data )
We then collect the regression results into a dataframe that contains the estimates and corresponding 95 percent confidence intervals. Note that we also add a row with zeros for the (omitted) reference point of the time dummies.
= pd.DataFrame({
lag0_row "term": ["lag0"],
"estimate": [0],
"std_error": [0],
"ci_up": [0],
"ci_low": [0],
"diff_to_treatment": [0],
"date": [treatment_date - pd.offsets.MonthBegin()]
})
= (pd.DataFrame({
model_with_fe_time_coefs "estimate": model_with_fe_time.coef(),
"std_error": model_with_fe_time.se()
})="term")
.reset_index(names
.assign(=lambda x: x["estimate"]+norm.ppf(0.975)*x["std_error"],
ci_up=lambda x: x["estimate"]+norm.ppf(0.025)*x["std_error"]
ci_low
)="left", left_on="term", right_on="variable_name")
.merge(variables, how="variable_name")
.drop(columns
)
= pd.concat(
model_with_fe_time_coefs
[model_with_fe_time_coefs, lag0_row], =True
ignore_index )
Figure 2 shows the resulting coefficient estimates.
= (
polluter_plot
ggplot(model_with_fe_time_coefs, ="date", y="estimate")) +
aes(x=treatment_date - pd.offsets.MonthBegin()),
geom_vline(aes(xintercept="dashed") +
linetype=0), linetype="dashed") +
geom_hline(aes(yintercept="ci_low", ymax="ci_up"), alpha=0.5) +
geom_errorbar(aes(ymin="estimate")) +
geom_point(aes(y="", y="Yield",
labs(x="Polluters' yield patterns around Paris Agreement signing") +
title=date_breaks("1 year"), labels=date_format("%Y"))
scale_x_datetime(breaks
) polluter_plot.draw()
The resulting graph shown in Figure 2 confirms the main conclusion of the previous image: polluters’ yield patterns show a considerable anticipation effect starting toward the end of 2014. Yields only marginally increase after the signing of the agreement. However, as opposed to the simple model, we do not see a complete reversal back to the pre-agreement level. Yields of polluters stay at a significantly higher level even one year after the signing.
Notice that during the year after the PA was signed, the 45th president of the United States was elected (on November 8, 2016). During his campaign there were some indications of intentions to withdraw the US from the PA, which ultimately happened on November 4, 2020. Hence, reversal effects are potentially driven by these actions.
Exercises
- The 46th president of the US rejoined the Paris Agreement on February 19, 2021. Repeat the difference in differences analysis for the day of his election victory. Note that you will also have to download new TRACE data. How did polluters’ yields react to this action?
- Based on the exercise on ratings in TRACE and FISD, include ratings as a control variable in the analysis above. Do the results change?
References
Footnotes
Note that by using a generic name here, everybody can replace ours with their sample data and run the code to produce standard regression tables and illustrations.↩︎