Introduction to Tidy Finance

Note

You are reading Tidy Finance with Python. You can find the equivalent chapter for the sibling Tidy Finance with R here.

The main aim of this chapter is to familiarize yourself with pandas (McKinney 2010) and numpy (Harris et al. 2020), the main workhorses for data analysis in Python. We start by downloading and visualizing stock data from Yahoo!Finance. Then, we move to a simple portfolio choice problem and construct the efficient frontier. These examples introduce you to our approach of Tidy Finance.

Working with Stock Market Data

At the start of each session, we load the required Python packages. Throughout the entire book, we always use pandas and numpy to perform a number of data manipulations. In this chapter, we also load the convenient yfinance (Aroussi 2023) package to download price data.

import pandas as pd
import numpy as np
import yfinance as yf

Note that import pandas as pd implies that we can call all pandas functions later with a simple pd.function(). Instead, utilizing from pandas import * is generally discouraged, as it leads to namespace pollution. This statement imports all functions and classes from pandas into your current namespace, potentially causing conflicts with functions you define or those from other imported libraries. Using the pd abbreviation is a very convenient way to prevent this.

We first download daily prices for one stock symbol, e.g., the Apple stock, AAPL, directly from the data provider Yahoo!Finance. To download the data, you can use the function yf.download(). The data from Yahoo!Finance comes as a dataframe, a two-dimensional, tabular data structure in which each row is indexed, and each column has a name. In this case, since only one ticker symbol is requested we specify that the data be returned with a single-level index. After the download, we apply a set of functions directly on the dataframe. First, we put the date index into a separate column. Second, we add the column symbol that stores the symbol information, and finally, we rename all columns to lowercase names. Dataframes allow for chaining all these operations sequentially through using ..

prices = (yf.download(
    tickers="AAPL", 
    start="2000-01-01", 
    end="2023-12-31", 
    progress=False,
    auto_adjust=False,
    multi_level_index=False,
  )
  .reset_index()
  .assign(symbol="AAPL")
  .rename(columns={
    "Date": "date", 
    "Open": "open", 
    "High": "high",
    "Low": "low",
    "Close": "close", 
    "Adj Close": "adjusted", 
    "Volume": "volume"}
  )
)
prices.head().round(3)
date adjusted close high low open volume symbol
0 2000-01-03 0.843 0.999 1.004 0.908 0.936 535796800 AAPL
1 2000-01-04 0.772 0.915 0.988 0.903 0.967 512377600 AAPL
2 2000-01-05 0.783 0.929 0.987 0.920 0.926 778321600 AAPL
3 2000-01-06 0.716 0.848 0.955 0.848 0.948 767972800 AAPL
4 2000-01-07 0.749 0.888 0.902 0.853 0.862 460734400 AAPL

yf.download() downloads stock market data from Yahoo!Finance. The above code chunk returns a dataframe with eight quite self-explanatory columns: date, the market prices at the open, high, low, and close, the adjusted price in USD, the daily volume (in the number of traded shares), and the symbol. The adjusted prices are corrected for anything that might affect the stock price after the market closes, e.g., stock splits and dividends. These actions affect the quoted prices but have no direct impact on the investors who hold the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns an investor would have earned by holding the stock continuously.

Next, we use the plotnine (Kibirige 2023b) package to visualize the time series of adjusted prices in Figure 1. This package takes care of visualization tasks based on the principles of the Grammar of Graphics (Wilkinson 2012). Note that generally, we do not recommend using the * import style. However, we use it here only for the plotting functions, which are distinct to plotnine and have very plotting-related names. So, the risk of misuse through a polluted namespace is marginal.

from plotnine import *

Creating figures becomes very intuitive with the Grammar of Graphics, as the following code chunk demonstrates.

prices_figure = (
  ggplot(prices, 
         aes(y="adjusted", x="date")) +
  geom_line() +
  labs(x="", y="",
       title="Apple stock prices from 2000 to 2023")
)
prices_figure.draw()
Title: Apple stock prices between the beginning of 2000 and the end of 2023. The figure shows that the stock price of Apple increased dramatically from about 1 USD to around 125 USD.
Figure 1: The figure shows Apple stock prices between the beginning of 2000 and the end of 2023. Prices are in USD, adjusted for dividend payments and stock splits.

Instead of analyzing prices, we compute daily returns defined as

\[r_t = p_t / p_{t-1} - 1, \tag{1}\]

where \(p_t\) is the adjusted price on day \(t\). In that context, the function pct_change() is helpful because it computes this percentage change.

returns = (prices
  .sort_values("date")
  .assign(ret=lambda x: x["adjusted"].pct_change())
  .get(["symbol", "date", "ret"])
)
symbol date ret
0 AAPL 2000-01-03 NaN
1 AAPL 2000-01-04 -0.084
2 AAPL 2000-01-05 0.015
3 AAPL 2000-01-06 -0.087
4 AAPL 2000-01-07 0.047

The resulting dataframe contains three columns, where the last contains the daily returns (ret). Note that the first entry naturally contains a missing value (NaN) because there is no previous price. Obviously, the use of pct_change() would be meaningless if the time series is not ordered by ascending dates. The function sort_values() provides a convenient way to order observations in the correct way for our application. In case you want to order observations by descending dates, you can use the parameter ascending=False.

For the upcoming examples, we remove missing values, as these would require separate treatment when computing, e.g., sample averages. In general, however, make sure you understand why NA values occur and carefully examine if you can simply get rid of these observations. The dataframe dropna() method kicks out all rows that contain a missing value in any column.

returns = returns.dropna() 

Next, we visualize the distribution of daily returns in a histogram in Figure 2, where we also introduce the mizani (Kibirige 2023a) package for formatting functions. Additionally, we add a dashed line that indicates the five percent quantile of the daily returns to the histogram, which is a (crude) proxy for the worst return of the stock with a probability of at most five percent. The five percent quantile is closely connected to the (historical) value-at-risk, a risk measure commonly monitored by regulators. We refer to Tsay (2010) for a more thorough introduction to stylized facts of returns.

from mizani.formatters import percent_format

quantile_05 = returns["ret"].quantile(0.05)

returns_figure = (
  ggplot(returns, aes(x="ret")) +
  geom_histogram(bins=100) +
  geom_vline(aes(xintercept=quantile_05), 
                 linetype="dashed") +
  labs(x="", y="",
       title="Distribution of daily Apple stock returns") +
  scale_x_continuous(labels=percent_format())
)
returns_figure.draw()
Title: Distribution of daily Apple stock returns in percent. The figure shows a histogram of daily returns. The range indicates a few large negative values, while the remaining returns are distributed around 0. The vertical line indicates that the historical five percent quantile of daily returns was around negative three percent.
Figure 2: The figure shows a histogram of daily Apple stock returns in percent. The dotted vertical line indicates the historical five percent quantile.

Here, bins=100 determines the number of bins used in the illustration and, hence, implicitly the width of the bins. Before proceeding, make sure you understand how to use the geom geom_vline() to add a dashed line that indicates the five percent quantile of the daily returns. A typical task before proceeding with any data is to compute summary statistics for the main variables of interest.

pd.DataFrame(returns["ret"].describe()).round(3).T
count mean std min 25% 50% 75% max
ret 6036.0 0.001 0.025 -0.519 -0.01 0.001 0.013 0.139

We see that the maximum daily return was 13.9 percent. Perhaps not surprisingly, the average daily return is close to but slightly above 0. In line with the illustration above, the large losses on the day with the minimum returns indicate a strong asymmetry in the distribution of returns.

You can also compute these summary statistics for each year individually by imposing .groupby(returns["date"].dt.year), where the call .dt.year returns the year of a date variable. More specifically, the few lines of code below compute the summary statistics from above for individual groups of data defined by year. The summary statistics, therefore, allow an eyeball analysis of the time-series dynamics of the return distribution.

(returns["ret"]
  .groupby(returns["date"].dt.year)
  .describe()
  .round(3)
)
count mean std min 25% 50% 75% max
date
2000 251.0 -0.003 0.055 -0.519 -0.034 -0.002 0.027 0.137
2001 248.0 0.002 0.039 -0.172 -0.023 -0.001 0.027 0.129
2002 252.0 -0.001 0.031 -0.150 -0.019 -0.003 0.018 0.085
2003 252.0 0.002 0.023 -0.081 -0.012 0.002 0.015 0.113
2004 252.0 0.005 0.025 -0.056 -0.009 0.003 0.016 0.132
2005 252.0 0.003 0.024 -0.092 -0.010 0.003 0.017 0.091
2006 251.0 0.001 0.024 -0.063 -0.014 -0.002 0.014 0.118
2007 251.0 0.004 0.024 -0.070 -0.009 0.003 0.018 0.105
2008 253.0 -0.003 0.037 -0.179 -0.024 -0.001 0.019 0.139
2009 252.0 0.004 0.021 -0.050 -0.009 0.002 0.015 0.068
2010 252.0 0.002 0.017 -0.050 -0.006 0.002 0.011 0.077
2011 252.0 0.001 0.017 -0.056 -0.009 0.001 0.011 0.059
2012 250.0 0.001 0.019 -0.064 -0.008 0.000 0.012 0.089
2013 252.0 0.000 0.018 -0.124 -0.009 -0.000 0.011 0.051
2014 252.0 0.001 0.014 -0.080 -0.006 0.001 0.010 0.082
2015 252.0 0.000 0.017 -0.061 -0.009 -0.001 0.009 0.057
2016 252.0 0.001 0.015 -0.066 -0.006 0.001 0.008 0.065
2017 251.0 0.002 0.011 -0.039 -0.004 0.001 0.007 0.061
2018 251.0 -0.000 0.018 -0.066 -0.009 0.001 0.009 0.070
2019 252.0 0.003 0.016 -0.100 -0.005 0.003 0.012 0.068
2020 253.0 0.003 0.029 -0.129 -0.010 0.002 0.017 0.120
2021 252.0 0.001 0.016 -0.042 -0.008 0.001 0.012 0.054
2022 251.0 -0.001 0.022 -0.059 -0.016 -0.001 0.014 0.089
2023 250.0 0.002 0.013 -0.048 -0.006 0.002 0.009 0.047

Scaling Up the Analysis

As a next step, we generalize the code from before such that all the computations can handle an arbitrary vector of stock symbols (e.g., all constituents of an index). Following tidy principles, it is quite easy to download the data, plot the price time series, and tabulate the summary statistics for an arbitrary number of assets.

This is where the magic starts: tidy data makes it extremely easy to generalize the computations from before to as many assets as you like. The following code takes any vector of symbols, e.g., symbol=["AAPL", "MMM", "BA"], and automates the download as well as the plot of the price time series. In the end, we create the table of summary statistics for an arbitrary number of assets. We perform the analysis with data from all current constituents of the Dow Jones Industrial Average index.

We first download a table with DOW Jones constituents from an external website. Note that you need to temporarily modify the behavior of handling SSL certificates in Python’s ssl module when reading the constituents directly from the web. This approach should be used with caution, which is why we revert the setting to the default behavior after the successful data download.

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

url = ("https://www.ssga.com/us/en/institutional/etfs/library-content/"
       "products/fund-data/etfs/us/holdings-daily-us-en-dia.xlsx")

symbols = (pd.read_excel(url, skiprows=4, nrows=30)
  .get("Ticker")
  .tolist()
)

ssl._create_default_https_context = ssl.create_default_context

Next, we can use yf.download() to download prices for all stock symbols in the above list and again chain a couple of pandas dataframe functions to create a tidy dataset.

prices_daily = (yf.download(
    tickers=symbols, 
    start="2000-01-01", 
    end="2023-12-31", 
    progress=False,
    auto_adjust=False,
    multi_level_index=False
  ))

prices_daily = (prices_daily
  .stack()
  .reset_index(level=1, drop=False)
  .reset_index()
  .rename(columns={
    "Date": "date",
    "Ticker": "symbol",
    "Open": "open",
    "High": "high",
    "Low": "low",
    "Close": "close",
    "Adj Close": "adjusted",
    "Volume": "volume"}
  )
)

The resulting dataframe contains 177,925 daily observations for 30 different stock symbols. Figure 3 illustrates the time series of downloaded adjusted prices for each of the constituents of the Dow Jones index. We again draw on the mizani package, but this time we use its useful date formatting function to get nicer axis labels. Make sure you understand every single line of code! What are the arguments of aes()? Which alternative geoms could you use to visualize the time series? Hint: If you do not know the answers, try to change the code to see what difference your intervention causes.

from mizani.breaks import date_breaks
from mizani.formatters import date_format

prices_daily_figure = (
  ggplot(prices_daily, 
         aes(y="adjusted", x="date", color="symbol")) +
 geom_line() +
 labs(x="", y="", color="",
      title="Stock prices of DOW index constituents") +
 theme(legend_position="none") +
 scale_x_datetime(date_breaks="5 years", date_labels="%Y")
)
prices_daily_figure.draw()
Title: Stock prices of DOW index constituents. The figure shows many time series with daily prices. The general trend seems positive for most stocks in the DOW index.
Figure 3: The figure shows the stock prices of DOW index constituents. Prices are in USD, adjusted for dividend payments and stock splits.

Do you notice the small differences relative to the code we used before? yf.download(symbols) returns a dataframe for several symbols as well. All we need to do to illustrate all symbols simultaneously is to include color="symbol" in the ggplot aesthetics. In this way, we generate a separate line for each symbol. Of course, there are simply too many lines on this graph to identify the individual stocks properly, but it illustrates the point well.

The same holds for stock returns. Before computing the returns, we use groupby("symbol") such that the assign() command is performed to calculate the returns for each symbol individually and assign it to the variable ret in the dataframe. The same logic also applies to the computation of summary statistics: groupby("symbol") is the key to aggregating the time series into symbol-specific variables of interest.

returns_daily = (prices_daily
  .assign(ret=lambda x: x.groupby("symbol")["adjusted"].pct_change())
  .get(["symbol", "date", "ret"])
  .dropna(subset="ret")
)

(returns_daily
  .groupby("symbol")["ret"]
  .describe()
  .round(3)
)
count mean std min 25% 50% 75% max
symbol
AAPL 6036.0 0.001 0.025 -0.519 -0.010 0.001 0.013 0.139
AMGN 6036.0 0.000 0.019 -0.134 -0.009 0.000 0.009 0.151
AMZN 6036.0 0.001 0.032 -0.248 -0.012 0.000 0.014 0.345
AXP 6036.0 0.001 0.023 -0.176 -0.009 0.000 0.010 0.219
BA 6036.0 0.001 0.022 -0.238 -0.010 0.001 0.011 0.243
CAT 6036.0 0.001 0.020 -0.145 -0.010 0.001 0.011 0.147
CRM 4914.0 0.001 0.027 -0.271 -0.012 0.001 0.014 0.260
CSCO 6036.0 0.000 0.023 -0.162 -0.009 0.000 0.010 0.244
CVX 6036.0 0.001 0.018 -0.221 -0.008 0.001 0.009 0.227
DIS 6036.0 0.000 0.019 -0.184 -0.009 0.000 0.009 0.160
GS 6036.0 0.001 0.023 -0.190 -0.010 0.000 0.011 0.265
HD 6036.0 0.001 0.019 -0.287 -0.008 0.001 0.009 0.141
HON 6036.0 0.000 0.019 -0.174 -0.008 0.001 0.009 0.282
IBM 6036.0 0.000 0.016 -0.155 -0.007 0.000 0.008 0.120
JNJ 6036.0 0.000 0.012 -0.158 -0.005 0.000 0.006 0.122
JPM 6036.0 0.001 0.024 -0.207 -0.009 0.000 0.010 0.251
KO 6036.0 0.000 0.013 -0.101 -0.005 0.000 0.006 0.139
MCD 6036.0 0.001 0.015 -0.159 -0.006 0.001 0.007 0.181
MMM 6036.0 0.000 0.015 -0.129 -0.007 0.000 0.008 0.126
MRK 6036.0 0.000 0.017 -0.268 -0.007 0.000 0.008 0.130
MSFT 6036.0 0.001 0.019 -0.156 -0.008 0.000 0.009 0.196
NKE 6036.0 0.001 0.019 -0.198 -0.008 0.001 0.009 0.155
NVDA 6036.0 0.002 0.038 -0.352 -0.016 0.001 0.018 0.424
PG 6036.0 0.000 0.013 -0.302 -0.005 0.000 0.006 0.120
SHW 6036.0 0.001 0.018 -0.208 -0.008 0.001 0.009 0.153
TRV 6036.0 0.001 0.018 -0.208 -0.007 0.001 0.008 0.256
UNH 6036.0 0.001 0.020 -0.186 -0.008 0.001 0.010 0.348
V 3973.0 0.001 0.019 -0.136 -0.008 0.001 0.009 0.150
VZ 6036.0 0.000 0.015 -0.118 -0.007 0.000 0.007 0.146
WMT 6036.0 0.000 0.015 -0.114 -0.007 0.000 0.007 0.117

Other Forms of Data Aggregation

Of course, aggregation across variables other than symbol can also make sense. For instance, suppose you are interested in answering the question: Are days with high aggregate trading volume likely followed by days with high aggregate trading volume? To provide some initial analysis on this question, we take the downloaded data and compute aggregate daily trading volume for all Dow Jones constituents in USD. Recall that the column volume is denoted in the number of traded shares. Thus, we multiply the trading volume with the daily adjusted closing price to get a proxy for the aggregate trading volume in USD. Scaling by 1e9 (Python can handle scientific notation) denotes daily trading volume in billion USD.

trading_volume = (prices_daily
  .assign(trading_volume=lambda x: (x["volume"]*x["adjusted"])/1e9)
  .groupby("date")["trading_volume"]
  .sum()
  .reset_index()
  .assign(trading_volume_lag=lambda x: x["trading_volume"].shift(periods=1))
)

trading_volume_figure = (
  ggplot(trading_volume, 
          aes(x="date", y="trading_volume")) +
  geom_line() +
  labs(x="", y="",
       title=("Aggregate daily trading volume of DOW index constituents "
              "in billion USD"))  +
 scale_x_datetime(date_breaks="5 years", date_labels="%Y")
)
trading_volume_figure.draw()
Title: Aggregate daily trading volume. The figure shows a volatile time series of daily trading volume, ranging from 15 in 2000 to 20.5 in 2023, with a maximum of more than 100.
Figure 4: The figure shows the total daily trading volume in billion USD.

Figure 4 indicates a clear upward trend in aggregated daily trading volume. In particular, since the outbreak of the COVID-19 pandemic, markets have processed substantial trading volumes, as analyzed, for instance, by Goldstein, Koijen, and Mueller (2021). One way to illustrate the persistence of trading volume would be to plot volume on day \(t\) against volume on day \(t-1\) as in the example below. In Figure 5, we add a dotted 45°-line to indicate a hypothetical one-to-one relation by geom_abline(), addressing potential differences in the axes’ scales.

trading_volume_figure = (
  ggplot(trading_volume, 
         aes(x="trading_volume_lag", y="trading_volume")) +
  geom_point() +
  geom_abline(aes(intercept=0, slope=1), linetype="dashed") +
  labs(x="Previous day aggregate trading volume",
       y="Aggregate trading volume",
       title=("Persistence in daily trading volume of DOW constituents "
              "in billion USD"))
)
trading_volume_figure.draw()
Title: Persistence in daily trading volume of DOW index constituents. The figure shows a scatterplot where aggregate trading volume and previous-day aggregate trading volume neatly line up along a 45-degree line.
Figure 5: The figure a scatterplot of aggregate trading volume against previous-day aggregate trading volume.

Portfolio Choice Problems

In the previous part, we show how to download stock market data and inspect it with graphs and summary statistics. Now, we move to a typical question in Finance: How to allocate wealth across different assets optimally. The standard framework for optimal portfolio selection considers investors that prefer higher future returns but dislike future return volatility (defined as the square root of the return variance, i.e., the risk): the mean-variance investor (Markowitz 1952).

An essential tool to evaluate portfolios in the mean-variance context is the efficient frontier, the set of portfolios that satisfies the condition that no other portfolio exists with a higher expected return but with the same volatility, see, e.g., Merton (1972). We compute and visualize the efficient frontier for several stocks. First, we extract each asset’s ly returns. In order to keep things simple, we work with a balanced panel and exclude DOW constituents for which we do not observe a price on every single trading day since the year 2000.

prices_monthly = (prices_daily
  .groupby("symbol")
  .apply(lambda x: x.assign(counts=x["adjusted"].dropna().count()))
  .reset_index(drop=True)
  .query("counts == counts.max()")
)

Next, we transform the returns from a tidy dataframe into a \((T \times N)\) matrix with one column for each of the \(N\) symbols and one row for each of the \(T\) trading days to compute the sample average return vector \[\hat\mu = \frac{1}{T}\sum\limits_{t=1}^T r_t, \tag{2}\] where \(r_t\) is the \(N\) vector of returns on date \(t\) and the sample covariance matrix \[\hat\Sigma = \frac{1}{T-1}\sum\limits_{t=1}^T (r_t - \hat\mu)(r_t - \hat\mu)'. \tag{3}\] We achieve this by using pivot() with the new column names from the column symbol and setting the values to adjusted.

In financial econometrics, a core focus falls on problems that arise if the investor has to rely on estimates \(\hat\mu\) and \(\hat\Sigma\) instead of using the vector of expected returns \(\mu\) and the variance-covariance matrix \(\Sigma\). We highlight the impact of estimation uncertainty on the portfolio performance in various backtesting applications in Parametric Portfolio Policies and Constrained Optimization and Backtesting.

For now, we focus on a much more restricted set of assumptions: The \(N\) assets are fixed, and the first two moments of the distribution of the returns are determined by the parameters \(\mu\) and \(\Sigma\). Thus, even though we proceed with the vector of sample average returns and the sample variance-covariance matrix, those will be handled as the true parameters of the return distribution for the rest of this chapter. We, therefore, refer to \(\Sigma\) and \(\mu\) instead of explicitly highlighting that the sample moments are estimates.

returns_matrix = (prices_monthly
  .pivot(columns="symbol", values="adjusted", index="date")
  .resample("m")
  .last()
  .pct_change()
  .dropna()
)
mu = np.array(returns_matrix.mean()).T
sigma = np.array(returns_matrix.cov())

Then, we compute the minimum variance portfolio weights \(\omega_\text{mvp}\) as well as the expected return \(\omega_\text{mvp}'\mu\) and volatility \(\sqrt{\omega_\text{mvp}'\Sigma\omega_\text{mvp}}\) of this portfolio. Recall that the minimum variance portfolio is the vector of portfolio weights that are the solution to \[\omega_\text{mvp} = \arg\min \omega'\Sigma \omega \text{ s.t. } \sum\limits_{i=1}^N\omega_i = 1. \tag{4}\] The constraint that weights sum up to one simply implies that all funds are distributed across the available asset universe, i.e., there is no possibility to retain cash. It is easy to show analytically that \(\omega_\text{mvp} = \frac{\Sigma^{-1}\iota}{\iota'\Sigma^{-1}\iota}\), where \(\iota\) is a vector of ones and \(\Sigma^{-1}\) is the inverse of \(\Sigma\). We provide the proof of the analytical solution in Proofs.

N = returns_matrix.shape[1]
iota = np.ones(N)
sigma_inv = np.linalg.inv(sigma) 

mvp_weights = sigma_inv @ iota
mvp_weights = mvp_weights/mvp_weights.sum()
mvp_return = mu.T @ mvp_weights
mvp_volatility = np.sqrt(mvp_weights.T @ sigma @ mvp_weights)
mvp_moments = pd.DataFrame({"value": [mvp_return, mvp_volatility]},
                           index=["average_ret", "volatility"])
mvp_moments.round(3)
value
average_ret 0.008
volatility 0.032

The command np.linalg.inv() returns the inverse of a matrix such that np.linalg.inv(sigma) delivers \(\Sigma^{-1}\) (if a unique solution exists).

Note that the monthly volatility of the minimum variance portfolio is of the same order of magnitude as the daily standard deviation of the individual components. Thus, the diversification benefits in terms of risk reduction are tremendous!

Next, we set out to find the weights for a portfolio that achieves, as an example, three times the expected return of the minimum variance portfolio. However, mean-variance investors are not interested in any portfolio that achieves the required return but rather in the efficient portfolio, i.e., the portfolio with the lowest standard deviation. If you wonder where the solution \(\omega_\text{eff}\) comes from: The efficient portfolio is chosen by an investor who aims to achieve minimum variance given a minimum acceptable expected return \(\bar{\mu}\). Hence, their objective function is to choose \(\omega_\text{eff}\) as the solution to \[\omega_\text{eff}(\bar{\mu}) = \arg\min \omega'\Sigma \omega \text{ s.t. } \omega'\iota = 1 \text{ and } \omega'\mu \geq \bar{\mu}. \tag{5}\]

In Proofs, we show that the efficient portfolio takes the form (for \(\bar{\mu} \geq D/C = \mu'\omega_\text{mvp}\)) \[\omega_\text{eff}\left(\bar\mu\right) = \omega_\text{mvp} + \frac{\tilde\lambda}{2}\left(\Sigma^{-1}\mu -\frac{D}{C}\Sigma^{-1}\iota \right)\] where \(C:= \iota'\Sigma^{-1}\iota\), \(D:= \iota'\Sigma^{-1}\mu\), \(E:= \mu'\Sigma^{-1}\mu\), and \(\tilde\lambda = 2\frac{\bar\mu - D/C}{E-D^2/C}\).

The code below implements the analytic solution to this optimization problem for a benchmark return \(\bar\mu\), which we set to 3 times the expected return of the minimum variance portfolio. We encourage you to verify that it is correct.

benchmark_multiple = 3
mu_bar = benchmark_multiple*mvp_return
C = iota.T @ sigma_inv @ iota
D = iota.T @ sigma_inv @ mu
E = mu.T @ sigma_inv @ mu
lambda_tilde = 2*(mu_bar-D/C)/(E-D**2/C)
efp_weights = mvp_weights+lambda_tilde/2*(sigma_inv @ mu-D*mvp_weights)

The Efficient Frontier

The mutual fund separation theorem states that as soon as we have two efficient portfolios (such as the minimum variance portfolio \(\omega_\text{mvp}\) and the efficient portfolio for a higher required level of expected returns \(\omega_\text{eff}(\bar{\mu})\), we can characterize the entire efficient frontier by combining these two portfolios. That is, any linear combination of the two portfolio weights will again represent an efficient portfolio. The code below implements the construction of the efficient frontier, which characterizes the highest expected return achievable at each level of risk. To understand the code better, make sure to familiarize yourself with the inner workings of the for loop.

length_year = 12
a = np.arange(-0.4, 2.0, 0.01)
results = pd.DataFrame(columns=["mu", "sd"], index=a).astype(float)

for i in a:
    w = (1-i)*mvp_weights+i*efp_weights
    results.loc[i, "mu"] = (w.T @ mu)*length_year
    results.loc[i, "sd"] = np.sqrt(w.T @ sigma @ w)*np.sqrt(length_year)

The code above proceeds in two steps: First, we compute a vector of combination weights \(a\), and then we evaluate the resulting linear combination with \(a\in\mathbb{R}\):
\[\omega^* = a\omega_\text{eff}(\bar\mu) + (1-a)\omega_\text{mvp} = \omega_\text{mvp} + \frac{\lambda^*}{2}\left(\Sigma^{-1}\mu -\frac{D}{C}\Sigma^{-1}\iota \right) \tag{6}\] with \(\lambda^* = 2\frac{a\bar\mu + (1-a)\tilde\mu - D/C}{E-D^2/C}\). It follows that \(\omega^* = \omega_\text{eff}\left(a\bar\mu + (1-a)\tilde\mu\right)\), in other words, \(\omega^*\) is an efficient portfolio that proofs the mutual fund separation theorem.

Finally, it is simple to visualize the efficient frontier alongside the two efficient portfolios within one powerful figure using the ggplot function from plotnine (see Figure 6). We also add the individual stocks in the same call. We compute annualized returns based on the simple assumption that monthly returns are independent and identically distributed. Thus, the average annualized return is just twelve times the expected monthly return.

mvp_return = (mu.T @ mvp_weights)*length_year
mvp_volatility = (np.sqrt(mvp_weights.T @ sigma @ mvp_weights)* 
                  np.sqrt(length_year))
efp_return = mu_bar*length_year
efp_volatility = (np.sqrt(efp_weights.T @ sigma @ efp_weights)* 
                  np.sqrt(length_year))

results_figure = (
  ggplot(results, aes(x="sd", y="mu")) +
  geom_point() +
  geom_point(
    pd.DataFrame({"mu": [mvp_return, efp_return],
                  "sd": [mvp_volatility, efp_volatility]}),
    size=4
  ) +
  geom_point(
    pd.DataFrame({"mu": mu*length_year,
                  "sd": np.sqrt(np.diag(sigma))*np.sqrt(length_year)})
  ) +
  labs(x="Annualized standard deviation",
       y="Annualized expected return",
       title="Efficient frontier for DOW index constituents") +
  scale_x_continuous(labels=percent_format()) +
  scale_y_continuous(labels=percent_format())
)
results_figure.draw()
Title: Efficient frontier for DOW index constituents. The figure shows DOW index constituents in a mean-variance diagram. A hyperbola indicates the efficient frontier of portfolios that dominate the individual holdings in the sense that they deliver higher expected returns for the same level of volatility.
Figure 6: The figure shows the efficient frontier for DOW index constituents. The big dots indicate the location of the minimum variance and the efficient portfolio that delivers three times the expected return of the minimum variance portfolio, respectively. The small dots indicate the location of the individual constituents.

The line in Figure 6 indicates the efficient frontier: the set of portfolios a mean-variance efficient investor would choose from. Compare the performance relative to the individual assets (the dots); it should become clear that diversifying yields massive performance gains (at least as long as we take the parameters \(\Sigma\) and \(\mu\) as given).

Exercises

  1. Download daily prices for another stock market symbol of your choice from Yahoo!Finance with yf.download() from the yfinance package. Plot two time series of the symbol’s unadjusted and adjusted closing prices. Explain the differences.
  2. Compute daily net returns for an asset of your choice and visualize the distribution of daily returns in a histogram using 100 bins. Also, use geom_vline() to add a dashed red vertical line that indicates the five percent quantile of the daily returns. Compute summary statistics (mean, standard deviation, minimum and maximum) for the daily returns.
  3. Take your code from before and generalize it such that you can perform all the computations for an arbitrary vector of symbols (e.g., symbol = ["AAPL", "MMM", "BA"]). Automate the download, the plot of the price time series, and create a table of return summary statistics for this arbitrary number of assets.
  4. Are days with high aggregate trading volume often also days with large absolute returns? Find an appropriate visualization to analyze the question using the symbol AAPL.
  5. Compute monthly returns from the downloaded stock market prices. Compute the vector of historical average returns and the sample variance-covariance matrix. Compute the minimum variance portfolio weights and the portfolio volatility and average returns. Visualize the mean-variance efficient frontier. Choose one of your assets and identify the portfolio which yields the same historical volatility but achieves the highest possible average return.
  6. In the portfolio choice analysis, we restricted our sample to all assets trading every day since 2000. How is such a decision a problem when you want to infer future expected portfolio performance from the results?
  7. The efficient frontier characterizes the portfolios with the highest expected return for different levels of risk. Identify the portfolio with the highest expected return per standard deviation. Which famous performance measure is close to the ratio of average returns to the standard deviation of returns?

References

Aroussi, Ran. 2023. yfinance.” https://pypi.org/project/yfinance/.
Goldstein, Itay, Ralph S J Koijen, and Holger M. Mueller. 2021. COVID-19 and its impact on financial markets and the real economy.” Review of Financial Studies 34 (11): 5135–48. https://doi.org/10.1093/rfs/hhab085.
Harris, Charles R., K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, et al. 2020. “Array Programming with NumPy.” Nature 585 (7825): 357–62. https://doi.org/10.1038/s41586-020-2649-2.
Kibirige, Hassan. 2023a. “Mizani: A Port of the Scales Package for Use with Plotnine.” https://pypi.org/project/mizani/.
———. 2023b. “Plotnine: An Implementation of the Grammar of Graphics in Python.” https://pypi.org/project/plotnine/.
Markowitz, Harry. 1952. Portfolio selection.” The Journal of Finance 7 (1): 77–91. https://doi.org/10.1111/j.1540-6261.1952.tb01525.x.
McKinney, Wes. 2010. Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 56–61. https://doi.org/10.25080/Majora-92bf1922-00a .
Merton, Robert C. 1972. An analytic derivation of the efficient portfolio frontier.” Journal of Financial and Quantitative Analysis 7 (4): 1851–72. https://doi.org/10.2307/2329621.
Tsay, Ruey S. 2010. Analysis of financial time series. John Wiley & Sons.
Wilkinson, Leland. 2012. The grammar of graphics. Springer.