Working with Stock Returns

Note

You are reading Tidy Finance with Python. You can find the equivalent chapter for the sibling Tidy Finance with R here.

The main aim of this chapter is to familiarize yourself with the tidyverse for working with stock market data. We focus on downloading and visualizing stock data from data provider Yahoo Finance.

At the start of each session, we load the required Python packages. Throughout the entire book, we always use the pandas () and numpy () packages. In this chapter, we also load the yfinance () package to download stock price data.

You typically have to install a package once before you can load it into your active R session. In case you have not done this yet, call, for instance, pip install pandas in your terminal.

import pandas as pd
import numpy as np
import yfinance as yf

Downloading Data

Note that import pandas as pd implies that we can call all pandas functions later with a simple pd.function(). Instead, utilizing from pandas import * is generally discouraged, as it leads to namespace pollution. This statement imports all functions and classes from pandas into your current namespace, potentially causing conflicts with functions you define or those from other imported libraries. Using the pd abbreviation is a very convenient way to prevent this.

We first download daily prices for one stock symbol, e.g., the Apple stock (AAPL), directly from the data provider Yahoo Finance. To download the data, you can use the function yf.download().

In the following code, we request daily data from the beginning of 2000 to the end of the last year, which is a period of more than 20 years.

prices = (yf.download(
    tickers="AAPL", 
    start="2000-01-01", 
    end="2023-12-31", 
    progress=False,
    auto_adjust=False,
    multi_level_index=False,
  )
  .reset_index()
  .assign(symbol="AAPL")
  .rename(columns={
    "Date": "date", 
    "Open": "open", 
    "High": "high",
    "Low": "low",
    "Close": "close", 
    "Adj Close": "adjusted", 
    "Volume": "volume"}
  )
)
prices.head().round(3)
date adjusted close high low open volume symbol
0 2000-01-03 0.842 0.999 1.004 0.908 0.936 535796800 AAPL
1 2000-01-04 0.771 0.915 0.988 0.903 0.967 512377600 AAPL
2 2000-01-05 0.782 0.929 0.987 0.920 0.926 778321600 AAPL
3 2000-01-06 0.715 0.848 0.955 0.848 0.948 767972800 AAPL
4 2000-01-07 0.749 0.888 0.902 0.853 0.862 460734400 AAPL

yf.download() downloads stock market data from Yahoo Finance. The above code chunk returns a data frame with eight self-explanatory columns: symbol, date, the daily volume (in the number of traded shares), the market prices at the open, high, low, close, and the adjusted price in USD. The adjusted prices are corrected for anything that might affect the stock price after the market closes, e.g., stock splits and dividends. These actions affect the quoted prices, but they have no direct impact on the investors who hold the stock. Therefore, we often rely on adjusted prices when it comes to analyzing the returns an investor would have earned by holding the stock continuously.

Next, we use the plotnine package () to visualize the time series of adjusted prices in . This package takes care of visualization tasks based on the principles of the grammar of graphics (). Note that generally, we do not recommend using the * import style. However, we use it here only for the plotting functions, which are distinct to plotnine and have very plotting-related names. So, the risk of misuse through a polluted namespace is marginal.

from plotnine import *

Creating figures becomes very intuitive with the Grammar of Graphics, as the following code chunk demonstrates.

(
  ggplot(prices, 
         aes(y="adjusted", x="date"))
  + geom_line()
  + labs(
      x="", y="",
      title="Apple stock prices from 2000 to 2023"
    )
)
Title: Apple stock prices between beginning of 2000 and end of 2023. The figure shows that the stock price of Apple increased from about 1 USD to around 125 USD.
Figure 1: Prices are in USD, adjusted for dividend payments and stock splits.

Computing Returns

Instead of analyzing prices, we compute daily returns defined as rt=pt/pt11, where pt is the adjusted price at the end of day t. In that context, the function lag() is helpful by returning the previous value.

returns = (prices
  .sort_values("date")
  .assign(ret=lambda x: x["adjusted"].pct_change())
  .get(["symbol", "date", "ret"])
)
returns
symbol date ret
0 AAPL 2000-01-03 NaN
1 AAPL 2000-01-04 -0.084310
2 AAPL 2000-01-05 0.014633
3 AAPL 2000-01-06 -0.086538
4 AAPL 2000-01-07 0.047369
... ... ... ...
6032 AAPL 2023-12-22 -0.005548
6033 AAPL 2023-12-26 -0.002841
6034 AAPL 2023-12-27 0.000518
6035 AAPL 2023-12-28 0.002226
6036 AAPL 2023-12-29 -0.005424

6037 rows × 3 columns

The resulting data frame has three columns, the last of which contains the daily returns (ret). Note that the first entry naturally contains a missing value (NaN) because there is no previous price. Obviously, the use of pct_change() would be meaningless if the time series is not ordered by ascending dates. The function sort_values() provides a convenient way to order observations in the correct way for our application. In case you want to order observations by descending dates, you can use the parameter ascending=False.

For the upcoming examples, we remove missing values as these would require separate treatment for many applications. For example, missing values can affect sums and averages by reducing the number of valid data points if not properly accounted for. In general, always ensure you understand why NaN values occur and carefully examine if you can simply get rid of these observations.

returns = returns.dropna() 

Next, we visualize the distribution of daily returns in a histogram in . Additionally, we draw a dashed line that indicates the historical five percent quantile of the daily returns to the histogram, which is a crude proxy for the worst possible return of the stock with a probability of at most five percent. This quantile is closely connected to the (historical) value-at-risk, a risk measure commonly monitored by regulators. We refer to Tsay () for a more thorough introduction to the stylized facts of financial returns.

from mizani.formatters import percent_format

quantile_05 = returns["ret"].quantile(0.05)

(
  ggplot(returns, aes(x="ret"))
  + geom_histogram(bins=100)
  + geom_vline(aes(xintercept=quantile_05), linetype="dashed")
  + labs(
      x="", y="",
      title="Distribution of daily Apple stock returns"
    )
  + scale_x_continuous(labels=percent_format())
)
Title: Distribution of daily Apple stock returns in percent. The figure shows a histogram of daily returns. The range indicates a few large negative values, while the remaining returns are distributed around zero. The vertical line indicates that the historical five percent quantile of daily returns was around negative three percent.
Figure 2: The dotted vertical line indicates the historical five percent quantile.

Here, bins=100 determines the number of bins used in the illustration and, hence, implicitly sets the width of the bins. Before proceeding, make sure you understand how to use the geom geom_vline() to add a dashed line that indicates the historical five percent quantile of the daily returns. Before proceeding with any data, a typical task is to compute and analyze the summary statistics for the main variables of interest.

pd.DataFrame(returns["ret"].describe()).round(3).T
count mean std min 25% 50% 75% max
ret 6036.0 0.001 0.025 -0.519 -0.01 0.001 0.013 0.139

We see that the maximum daily return was 13.9 percent. Perhaps not surprisingly, the average daily return is close to but slightly above 0. In line with the illustration above, the large losses on the day with the minimum returns indicate a strong asymmetry in the distribution of returns.

You can also compute these summary statistics for each year individually by imposing groupby(returns["date"].dt.year), where the call dt.year returns the year. More specifically, the few lines of code below compute the summary statistics from above for individual groups of data defined by the values of the column year. The summary statistics, therefore, allow an eyeball analysis of the time-series dynamics of the daily return distribution.

(returns["ret"]
  .groupby(returns["date"].dt.year)
  .describe()
  .round(3)
)
count mean std min 25% 50% 75% max
date
2000 251.0 -0.003 0.055 -0.519 -0.034 -0.002 0.027 0.137
2001 248.0 0.002 0.039 -0.172 -0.023 -0.001 0.027 0.129
2002 252.0 -0.001 0.031 -0.150 -0.019 -0.003 0.018 0.085
2003 252.0 0.002 0.023 -0.081 -0.012 0.002 0.015 0.113
2004 252.0 0.005 0.025 -0.056 -0.009 0.003 0.016 0.132
2005 252.0 0.003 0.024 -0.092 -0.010 0.003 0.017 0.091
2006 251.0 0.001 0.024 -0.063 -0.014 -0.002 0.014 0.118
2007 251.0 0.004 0.024 -0.070 -0.009 0.003 0.018 0.105
2008 253.0 -0.003 0.037 -0.179 -0.024 -0.001 0.019 0.139
2009 252.0 0.004 0.021 -0.050 -0.009 0.002 0.015 0.068
2010 252.0 0.002 0.017 -0.050 -0.006 0.002 0.011 0.077
2011 252.0 0.001 0.017 -0.056 -0.009 0.001 0.011 0.059
2012 250.0 0.001 0.019 -0.064 -0.008 0.000 0.012 0.089
2013 252.0 0.000 0.018 -0.124 -0.009 -0.000 0.011 0.051
2014 252.0 0.001 0.014 -0.080 -0.006 0.001 0.010 0.082
2015 252.0 0.000 0.017 -0.061 -0.009 -0.001 0.009 0.057
2016 252.0 0.001 0.015 -0.066 -0.006 0.001 0.008 0.065
2017 251.0 0.002 0.011 -0.039 -0.004 0.001 0.007 0.061
2018 251.0 -0.000 0.018 -0.066 -0.009 0.001 0.009 0.070
2019 252.0 0.003 0.016 -0.100 -0.005 0.003 0.012 0.068
2020 253.0 0.003 0.029 -0.129 -0.010 0.002 0.017 0.120
2021 252.0 0.001 0.016 -0.042 -0.008 0.001 0.012 0.054
2022 251.0 -0.001 0.022 -0.059 -0.016 -0.001 0.014 0.089
2023 250.0 0.002 0.013 -0.048 -0.006 0.002 0.009 0.047

Scaling Up the Analysis

As a next step, we generalize the previous code so that all computations can handle an arbitrary number of symbols (e.g., all constituents of an index). Following tidy principles, it is quite easy to download the data, plot the price time series, and tabulate the summary statistics for an arbitrary number of assets.

This is where the tidyverse magic starts: Tidy data makes it extremely easy to generalize the computations from before to as many assets or groups as you like. The following code takes any number of symbols, e.g., symbol = ["AAPL", "MMM", "BA"], and automates the download as well as the plot of the price time series. In the end, we create the table of summary statistics for all assets at once. For this example, we analyze data from all current constituents of the Dow Jones Industrial Average index.

We first download a table with DOW Jones constituents from an external website. Note that you need to temporarily modify the behavior of handling SSL certificates in Python’s ssl module when reading the constituents directly from the web. This approach should be used with caution, which is why we revert the setting to the default behavior after the successful data download.

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

url = ("https://www.ssga.com/us/en/institutional/etfs/library-content/"
       "products/fund-data/etfs/us/holdings-daily-us-en-dia.xlsx")

symbols = (pd.read_excel(url, skiprows=4, nrows=30)
  .get("Ticker")
  .tolist()
)

ssl._create_default_https_context = ssl.create_default_context

Conveniently, yf.download() provides the functionality to get all stock prices from an index for a specific point in time with a single call.

prices_daily = (yf.download(
    tickers=symbols, 
    start="2000-01-01", 
    end="2023-12-31", 
    progress=False,
    auto_adjust=False,
    multi_level_index=False
  ))

prices_daily = (prices_daily
  .stack()
  .reset_index(level=1, drop=False)
  .reset_index()
  .rename(columns={
    "Date": "date",
    "Ticker": "symbol",
    "Open": "open",
    "High": "high",
    "Low": "low",
    "Close": "close",
    "Adj Close": "adjusted",
    "Volume": "volume"}
  )
)

The resulting data frame contains 177,925 daily observations for 30 different stocks. illustrates the time series of the downloaded adjusted prices for each of the constituents of the Dow index. Make sure you understand every single line of code! What are the arguments of aes()? Which alternative geoms could you use to visualize the time series? Hint: if you do not know the answers try to change the code to see what difference your intervention causes.

from mizani.breaks import date_breaks
from mizani.formatters import date_format

(
  ggplot(prices_daily, 
         aes(y="adjusted", x="date", color="symbol"))
  + geom_line()
  + scale_x_datetime(date_breaks="5 years", date_labels="%Y")
  + labs(
      x="", y="", color="",
      title="Stock prices of DOW index constituents"
    )
  + theme(legend_position="none")
)
Title: Stock prices of Dow Jones index constituents. The figure shows many time series with daily prices. The general trend seems positive for most stocks in the Dow Jones index.
Figure 3: Prices in USD, adjusted for dividend payments and stock splits.

Do you notice the small differences relative to the code we used before? All we needed to do to illustrate all stock symbols simultaneously is to include color = symbol in the ggplot aesthetics. In this way, we generate a separate line for each symbol. Of course, there are simply too many lines on this graph to identify the individual stocks properly, but it illustrates our point of how to generalize a specific analysis to an arbitrage number of subjects quite well.

The same holds for stock returns. Before computing the returns, we use groupby("symbol") such that the assign() command is performed for each symbol individually. The same logic also applies to the computation of summary statistics: groupby("symbol") is the key to aggregating the time series into symbol-specific variables of interest.

returns_daily = (prices_daily
  .assign(ret=lambda x: x.groupby("symbol")["adjusted"].pct_change())
  .get(["symbol", "date", "ret"])
  .dropna(subset="ret")
)

(returns_daily
  .groupby("symbol")["ret"]
  .describe()
  .round(3)
)
count mean std min 25% 50% 75% max
symbol
AAPL 6036.0 0.001 0.025 -0.519 -0.010 0.001 0.013 0.139
AMGN 6036.0 0.000 0.019 -0.134 -0.009 0.000 0.009 0.151
AMZN 6036.0 0.001 0.032 -0.248 -0.012 0.000 0.014 0.345
AXP 6036.0 0.001 0.023 -0.176 -0.009 0.000 0.010 0.219
BA 6036.0 0.001 0.022 -0.238 -0.010 0.001 0.011 0.243
CAT 6036.0 0.001 0.020 -0.145 -0.010 0.001 0.011 0.147
CRM 4914.0 0.001 0.027 -0.271 -0.012 0.001 0.014 0.260
CSCO 6036.0 0.000 0.023 -0.162 -0.009 0.000 0.010 0.244
CVX 6036.0 0.001 0.018 -0.221 -0.008 0.001 0.009 0.227
DIS 6036.0 0.000 0.019 -0.184 -0.009 0.000 0.009 0.160
GS 6036.0 0.001 0.023 -0.190 -0.010 0.000 0.011 0.265
HD 6036.0 0.001 0.019 -0.287 -0.008 0.001 0.009 0.141
HON 6036.0 0.000 0.019 -0.174 -0.008 0.001 0.009 0.282
IBM 6036.0 0.000 0.016 -0.155 -0.007 0.000 0.008 0.120
JNJ 6036.0 0.000 0.012 -0.158 -0.005 0.000 0.006 0.122
JPM 6036.0 0.001 0.024 -0.207 -0.009 0.000 0.010 0.251
KO 6036.0 0.000 0.013 -0.101 -0.005 0.000 0.006 0.139
MCD 6036.0 0.001 0.015 -0.159 -0.006 0.001 0.007 0.181
MMM 6036.0 0.000 0.015 -0.129 -0.007 0.000 0.008 0.126
MRK 6036.0 0.000 0.017 -0.268 -0.007 0.000 0.008 0.130
MSFT 6036.0 0.001 0.019 -0.156 -0.008 0.000 0.009 0.196
NKE 6036.0 0.001 0.019 -0.198 -0.008 0.001 0.009 0.155
NVDA 6036.0 0.002 0.038 -0.352 -0.016 0.001 0.018 0.424
PG 6036.0 0.000 0.013 -0.302 -0.005 0.000 0.006 0.120
SHW 6036.0 0.001 0.018 -0.208 -0.008 0.001 0.009 0.153
TRV 6036.0 0.001 0.018 -0.208 -0.007 0.001 0.008 0.256
UNH 6036.0 0.001 0.020 -0.186 -0.008 0.001 0.010 0.348
V 3973.0 0.001 0.019 -0.136 -0.008 0.001 0.009 0.150
VZ 6036.0 0.000 0.015 -0.118 -0.007 0.000 0.007 0.146
WMT 6036.0 0.000 0.015 -0.114 -0.007 0.000 0.007 0.117

Different Frequencies

Financial data often exists at different frequencies due to varying reporting schedules, trading calendars, and economic data releases. For example, stock prices are typically recorded daily, while macroeconomic indicators such as GDP or inflation are reported monthly or quarterly. Additionally, some datasets are recorded only when transactions occur, resulting in irregular timestamps. To compare data meaningfully, we have to align different frequencies appropriately. For example, to compare returns across different frequencies, we use annualization techniques.

So far, we have worked with daily returns, but we can easily convert our data to other frequencies. Let’s create monthly returns from our daily data:

returns_monthly = (returns_daily
  .assign(date=returns_daily["date"].dt.to_period("M").dt.to_timestamp())
  .groupby(["symbol", "date"], as_index=False)
  .agg(ret=("ret", lambda x: np.prod(1 + x) - 1))
)

In this code, we first group the data by symbol and month and then compute monthly returns by compounding the daily returns: (1+r1)(1+r2)(1+rn)1. To visualize how return characteristics change across different frequencies, we can compare histograms as in :

apple_daily = (returns_daily
  .query("symbol == 'AAPL'")
  .assign(frequency="Daily")
)

apple_monthly = (returns_monthly
  .query("symbol == 'AAPL'")
  .assign(frequency="Monthly")
)

apple_returns = pd.concat([apple_daily, apple_monthly], ignore_index=True)

(
  ggplot(apple_returns, aes(x="ret", fill="frequency"))
  + geom_histogram(position="identity", bins=50)
  + labs(
      x="", y="", fill="Frequency",
      title="Distribution of Apple returns across different frequencies"
  )
  + scale_x_continuous(labels=percent_format())
  + facet_wrap("frequency", scales="free")
  + theme(legend_position="none")
)
Title: Distribution of Apple returns across different frequencies. The figure shows the distribution of daily and monthly returns in two separate facets.
Figure 4: Returns are based on prices adjusted for dividend payments and stock splits.

Other Forms of Data Aggregation

Of course, aggregation across variables other than symbol or date can also make sense. For instance, suppose you are interested in answering questions like: Are days with high aggregate trading volume likely followed by days with high aggregate trading volume? To provide some initial analysis on this question, we take the downloaded data and compute aggregate daily trading volume for all Dow index constituents in USD. Recall that the column volume is denoted in the number of traded shares. Thus, we multiply the trading volume with the daily adjusted closing price to get a proxy for the aggregate trading volume in USD. Scaling by 1e-9 (Python can handle scientific notation) denotes daily trading volume in billion USD.

trading_volume = (prices_daily
  .assign(trading_volume=lambda x: (x["volume"]*x["adjusted"])/1e9)
  .groupby("date")["trading_volume"]
  .sum()
  .reset_index()
  .assign(trading_volume_lag=lambda x: x["trading_volume"].shift(periods=1))
)

(
  ggplot(trading_volume, 
          aes(x="date", y="trading_volume"))
  + geom_line()
  + scale_x_datetime(date_breaks="5 years", date_labels="%Y")
  + labs(
      x="", y="",
      title=("Aggregate daily trading volume of DOW index constituents "
              "in billion USD")
    )
)
Title: Aggregate daily trading volume of Dow Jones index constitutens. The figure shows a volatile time series of daily trading volume, ranging from 15 in 2000 to 20.5 in 2023, with a maximum of more than 100.
Figure 5: Total daily trading volume in billion USD.

indicates a clear upward trend in aggregated daily trading volume. In particular, since the outbreak of the COVID-19 pandemic, markets have processed substantial trading volumes, as analyzed, for instance, by Goldstein, Koijen, and Mueller (). One way to illustrate the persistence of trading volume would be to plot volume on day t against volume on day t1 as in the example below. In , we add a dotted 45°-line to indicate a hypothetical one-to-one relation by geom_abline(), addressing potential differences in the axes’ scales.

(
  ggplot(trading_volume, 
         aes(x="trading_volume_lag", y="trading_volume")) +
  geom_point() +
  geom_abline(aes(intercept=0, slope=1), linetype="dashed") +
  labs(x="Previous day aggregate trading volume",
       y="Aggregate trading volume",
       title=("Persistence in daily trading volume of DOW constituents "
              "in billion USD"))
)
Title: Persistence in daily trading volume of Dow Jones index constituents. The figure shows a scatterplot where aggregate trading volume and previous-day aggregate trading volume neatly line up along a 45-degree line.
Figure 6: Total daily trading volume in billion USD.

Purely eye-balling reveals that days with high trading volume are often followed by similarly high trading volume days.

Key Takeaways

In this chapter, you learned how to effectively use R to download, analyze, and visualize stock market data using tidy principles:

  • Tidy data principles enable efficient analysis of financial data.
  • Adjusted prices account for corporate actions like stock splits and dividends.
  • Summary statistics help identify key patterns in financial data.
  • Visualization techniques reveal trends and distributions in returns.
  • Data manipulations with the tidyverse scale easily to multiple assets.
  • Consistent workflows form the foundation for advanced financial analysis.

Exercises

  1. Download daily prices for another stock market symbol of your choice from Yahoo Finance using yf.download(). Plot two time series of the symbol’s un-adjusted and adjusted closing prices. Explain any visible differences.
  2. Compute daily net returns for an asset of your choice and visualize the distribution of daily returns in a histogram using 100 bins. Also, use geom_vline() to add a dashed red vertical line that indicates the 5 percent quantile of the daily returns. Compute summary statistics (mean, standard deviation, minimum, and maximum) for the daily returns.
  3. Take your code from the previous exercises and generalize it such that you can perform all the computations for an arbitrary number of symbols (e.g., symbol = ["AAPL", "MMM", "BA"]). Automate the download, the plot of the price time series, and create a table of return summary statistics for this arbitrary number of assets.
  4. To facilitate the computation of the annualization factor, write a function that takes a vector of return dates as input and determines the frequency before returning the appropriate annualization factor.
  5. Are days with high aggregate trading volume often also days with large absolute returns? Find an appropriate visualization to analyze the question using the symbol AAPL.

References

Aroussi, Ran. 2023. yfinance.” https://pypi.org/project/yfinance/.
Goldstein, Itay, Ralph S J Koijen, and Holger M. Mueller. 2021. COVID-19 and its impact on financial markets and the real economy.” Review of Financial Studies 34 (11): 5135–48. https://doi.org/10.1093/rfs/hhab085.
Harris, Charles R., K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, et al. 2020. “Array Programming with NumPy.” Nature 585 (7825): 357–62. https://doi.org/10.1038/s41586-020-2649-2.
Kibirige, Hassan. 2023. “Plotnine: An Implementation of the Grammar of Graphics in Python.” https://pypi.org/project/plotnine/.
McKinney, Wes. 2010. Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by Stéfan van der Walt and Jarrod Millman, 56–61. https://doi.org/10.25080/Majora-92bf1922-00a .
Tsay, Ruey S. 2010. Analysis of financial time series. John Wiley & Sons.
Wilkinson, Leland. 2012. The grammar of graphics. Springer.