113. Unemployment Dynamics with Asymmetric Shocks#
GPU
This lecture was built using a machine with access to a GPU — although it will also run without one.
Google Colab has a free tier with GPUs that you can access as follows:
Click on the “play” icon top right
Select Colab
Set the runtime environment to include a GPU
This lecture is a sequel to A Linear Model of Unemployment.
There we fit a linear AR(1) model to the US unemployment rate and examined persistence.
One thing we found but didn’t address is that the model’s residuals were both heavy-tailed and right-skewed.
Unemployment jumps up sharply in recessions and drifts down slowly in recoveries.
A model with symmetric Gaussian shocks cannot reproduce these features.
In this lecture we allow the innovations to be right-skewed and, to some degree, heavy-tailed.
We also use this lecture to develop a very useful Bayesian technique: model comparison by leave-one-out cross-validation.
We use this to decide whether one model really predicts better than another.
Our plan is to
build a model with a linear mean but asymmetric, occasionally-large shocks,
estimate it on the annual data, and
compare it to the original Gaussian model with cross-validation.
In addition to what’s in Anaconda, this lecture needs the following libraries.
!pip install numpyro jax pandas_datareader arviz
We’ll use the following imports.
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
from pandas_datareader import data as web
import jax.numpy as jnp
from jax import random
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS, log_likelihood
Matplotlib is building the font cache; this may take a moment.
113.1. The data#
We use the same series as in A Linear Model of Unemployment — the US unemployment
rate (UNRATE from FRED), excluding the COVID-19 spike — and work at an annual
frequency.
start, end = dt.datetime(1948, 1, 1), dt.datetime(2024, 12, 31)
unrate = web.DataReader("UNRATE", "fred", start, end)["UNRATE"]
pre_covid = unrate[unrate.index < "2020-01-01"]
u_annual = pre_covid.resample("YE").last().to_numpy()
years = pre_covid.resample("YE").last().index.year
The shape we want to capture is plain in the annual series.
fig, ax = plt.subplots()
ax.plot(years, u_annual, lw=2, marker='o', ms=3)
ax.set_xlabel('year')
ax.set_ylabel('unemployment rate (%)')
plt.show()
Fig. 113.1 Annual US unemployment#
In Fig. 113.1 each recession is a sharp rise followed by a long, gradual decline — fast up, slow down.
113.2. A model with asymmetric shocks#
We keep the linear structure from A Linear Model of Unemployment:
Here, however, the innovation \(\eta_{t+1}\) is drawn from a mixture of two normals:
The model has six parameters:
symbol |
name |
role |
|---|---|---|
\(\bar u\) |
floor |
the low level the series reverts to |
\(\rho\) |
persistence |
speed of reversion (as in the linear lecture) |
\(p\) |
jump probability |
how often a large shock arrives |
\(\mu_J\) |
jump size |
the average upward kick of a recession |
\(\sigma_s,\ \sigma_J\) |
shock spreads |
the quiet and jumpy volatilities |
The mechanism leads to a sawtooth dynamic for the unemployment rate.
Most years are quiet, with small noise.
Occasionally a positive jump pushes unemployment up by a large amount.
Then, in most cases, the linear reversion pulls it slowly back down toward \(\bar u\).
Thus, a spike arrives in one step, while the recovery takes time — the asymmetric dynamics we observe in the data.
This is, in spirit, Milton Friedman’s “plucking model”: a floor near full employment, from which bad shocks pluck the series upward.
The next cell shows the density of the shock and its two components.
def normal_pdf(x, m, s):
return np.exp(-(x - m)**2 / (2 * s**2)) / (s * np.sqrt(2 * np.pi))
η = np.linspace(-3, 6, 400)
p_, μJ_, σs_, σJ_ = 0.25, 2.0, 0.4, 1.2
quiet = (1 - p_) * normal_pdf(η, 0.0, σs_)
jump = p_ * normal_pdf(η, μJ_, σJ_)
fig, ax = plt.subplots()
ax.plot(η, quiet + jump, lw=2, label='shock density')
ax.fill_between(η, quiet, alpha=0.3, label='quiet years')
ax.fill_between(η, jump, alpha=0.3, label='recession jumps')
ax.set_xlabel('innovation $\\eta$')
ax.legend()
plt.show()
Fig. 113.2 Shock distribution as a mixture#
Fig. 113.2 shows a tall, narrow peak at zero for ordinary years, plus a low, broad bump on the positive side for the recession jumps.
113.3. Bayesian estimation#
We place weakly informative priors on the six parameters.
The mixture is written with NumPyro’s MixtureSameFamily, which sums over the two components analytically.
def jump_model(u):
# Set the priors
ubar = numpyro.sample("ubar", dist.Normal(4.5, 1.5))
ρ = numpyro.sample("rho", dist.Uniform(0.0, 1.0))
p = numpyro.sample("p", dist.Beta(2.0, 8.0))
μ_J = numpyro.sample("mu_J", dist.HalfNormal(2.0))
σ_s = numpyro.sample("sigma_s", dist.HalfNormal(0.5))
σ_J = numpyro.sample("sigma_J", dist.HalfNormal(1.5))
# Build the model
n = u.shape[0] - 1
base = ubar + ρ * (u[:-1] - ubar)
locs = jnp.stack([base, base + μ_J], axis=-1)
scales = jnp.stack([jnp.broadcast_to(σ_s, (n,)),
jnp.broadcast_to(σ_J, (n,))], axis=-1)
probs = jnp.broadcast_to(jnp.stack([1 - p, p]), (n, 2))
mix = dist.MixtureSameFamily(dist.Categorical(probs=probs),
dist.Normal(locs, scales))
numpyro.sample("u_obs", mix, obs=u[1:])
We sample with NUTS, four chains, vectorized so the code runs on a CPU or a GPU.
def run_nuts(model, data, seed=0, num_warmup=2000, num_samples=4000, num_chains=4):
"Sample a NumPyro model with the NUTS sampler."
mcmc = MCMC(NUTS(model),
num_warmup=num_warmup, num_samples=num_samples,
num_chains=num_chains, chain_method="vectorized",
progress_bar=False)
mcmc.run(random.PRNGKey(seed), jnp.asarray(data))
return mcmc
We fit the model to the annual data.
mcmc_jump = run_nuts(jump_model, u_annual)
Let us look at the posterior.
mcmc_jump.print_summary()
mean std median 5.0% 95.0% n_eff r_hat
mu_J 1.26 0.43 1.20 0.57 2.00 3583.57 1.00
p 0.35 0.09 0.35 0.19 0.49 4178.84 1.00
rho 0.83 0.04 0.84 0.76 0.90 5632.75 1.00
sigma_J 1.28 0.26 1.29 0.83 1.73 3970.18 1.00
sigma_s 0.39 0.09 0.38 0.25 0.54 3715.71 1.00
ubar 3.03 0.74 3.09 1.81 4.19 5156.56 1.00
Number of divergences: 0
The r_hat values are essentially one, so the chains have converged.
Notice the distinct jump component: the normal year spread \(\sigma_s\) is small and the jump spread \(\sigma_J\) is large, with a positive mean \(\mu_J\).
Let’s do a quick check by overlaying the fitted shock density on the model’s actual residuals, as we did in A Linear Model of Unemployment.
m = {k: np.median(np.asarray(mcmc_jump.get_samples()[k]))
for k in ("ubar", "rho", "p", "mu_J", "sigma_s", "sigma_J")}
resid = u_annual[1:] - (m["ubar"] + m["rho"] * (u_annual[:-1] - m["ubar"]))
η = np.linspace(resid.min() - 0.5, resid.max() + 0.5, 400)
quiet = (1 - m["p"]) * normal_pdf(η, 0.0, m["sigma_s"])
jump = m["p"] * normal_pdf(η, m["mu_J"], m["sigma_J"])
fig, ax = plt.subplots()
ax.hist(resid, bins=25, density=True, alpha=0.5, label='residuals')
ax.plot(η, quiet + jump, lw=2, label='fitted density')
ax.fill_between(η, jump, alpha=0.3, color='C2', label='jump component')
ax.set_xlabel('innovation $\\eta$')
ax.legend()
plt.show()
Fig. 113.3 Fitted shock density against the residuals#
The fit looks quite good (although we are still struggling to capture some of the large shocks in the right tail).
113.4. Comparing models with cross-validation#
The jump model looks reasonable, but is it actually better than the linear one?
To answer this question, we will use a form of Bayesian model comparison.
We build it up in stages: the guiding principle, the leave-one-out estimate, and how the computation is actually done.
113.4.1. The guiding principle: out-of-sample predictions#
It is tempting to compare two models by asking which one fits the observed data better.
But this isn’t the right criterion: more complex models can always be tuned to better match the data they are fitting.
The right question is instead: which model better predicts data it has not seen?
In other words, the criterion we care about is out-of-sample predictive accuracy.
We do not have new data for out-of-sample testing, so we manufacture some using cross-validation.
113.4.2. Leave-one-out cross-validation#
The idea is to leave out one observation, fit the model to the rest, and then score the held-out point using a predictive distribution constructed from the rest of the data set.
The predictive distribution for data point \(u_i\) given the rest of the data \(u_{-i}\) is
Here \(p(\theta \mid u_{-i})\) is the posterior obtained from the data in \(u_{-i}\).
We score each held-out point by the log of this density.
This rewards a model for putting high probability on what actually happened.
Adding these scores over all data points gives the expected log predictive density (elpd), here in its leave-one-out (LOO) form,
113.4.3. Computational methods#
The definition of elpd suggests that we need to refit the model \(n\) times, once for each omitted point.
This would be very time-consuming.
Here we discuss a method that removes all but the first fit, using the full posterior — which we have already sampled.
Regarding notation: let \(u = (u_1, \dots, u_n)\), and write \(u_{-i}\) for the sample with \(u_i\) deleted.
The method depends on the following independence-type assumption:
Assumption 113.1
Given \(\theta\), the observations are conditionally independent:
For our autoregressive data this assumption is false, since each \(u_i\) is dependent on earlier observations through the transition dynamics.
We adopt it nonetheless, since
it’s what makes the leave-one-out computation tractable and
it’s a very standard procedure that needs to be understood.
We discuss the consequences in the section on time-series structure below.
Conditional on our assumption, the proposition that follows is exact.
Proposition 113.1
Under Assumption 113.1, we have
Proof. By Bayes’ rule, \(p(\theta \mid u) = p(u \mid \theta)\, p(\theta) / p(u)\) and \(p(\theta \mid u_{-i}) = p(u_{-i} \mid \theta)\, p(\theta) / p(u_{-i})\).
By Assumption 113.1, \(p(u \mid \theta) = p(u_i \mid \theta)\, p(u_{-i} \mid \theta)\).
Dividing the two posteriors and substituting this factorization,
so, rearranging,
Now integrate both sides over \(\theta\).
On the right, \(\int p(\theta \mid u_{-i})\, d\theta = 1\), leaving the constant factor \(p(u_{-i}) / p(u)\), so
the last equality by \(p(u) = p(u_i \mid u_{-i})\, p(u_{-i})\).
The leave-\(i\)-out posterior has vanished; only the full posterior remains, and of it we have \(S\) draws \(\theta^1, \dots, \theta^S\).
Replacing the integral in Proposition 113.1 by the average over these draws gives
the harmonic mean of the per-draw likelihoods of \(u_i\).
The integrand has finite mean, equal to \(1/p(u_i \mid u_{-i})\) itself, so the law of large numbers applies.
The average therefore converges to that mean as \(S \to \infty\), and the approximation becomes exact, with probability one, in the limit.
Its one ingredient is the likelihood of each observation under each draw — the pointwise log-likelihood — which is why we have been careful to keep it.
For numerical stability we compute on the log scale, where the harmonic mean becomes
which is the jnp.log(S) - logsumexp(-ll) we use below.
To compare against the linear model, we first fit it on the same data.
def linear_model(u):
ubar = numpyro.sample("ubar", dist.Normal(5.5, 2.0))
ρ = numpyro.sample("rho", dist.Uniform(0.0, 1.0))
σ = numpyro.sample("sigma", dist.HalfNormal(1.0))
numpyro.sample("u_obs", dist.Normal(ubar + ρ * (u[:-1] - ubar), σ), obs=u[1:])
We sample it on the same annual data.
mcmc_lin = run_nuts(linear_model, u_annual)
Now we can do the whole leave-one-out calculation by hand in a few lines.
from jax.scipy.special import logsumexp
def pointwise_loo(mcmc, model):
"Leave-one-out log predictive density for each observation."
ll = log_likelihood(model, mcmc.get_samples(),
u=jnp.asarray(u_annual))["u_obs"] # (draws, obs)
S = ll.shape[0]
# log of the harmonic mean of the per-draw likelihoods, per observation
return jnp.log(S) - logsumexp(-ll, axis=0)
elpd_jump = pointwise_loo(mcmc_jump, jump_model)
elpd_lin = pointwise_loo(mcmc_lin, linear_model)
print(f"jump elpd_loo = {elpd_jump.sum():.1f}")
print(f"linear elpd_loo = {elpd_lin.sum():.1f}")
jump elpd_loo = -93.9
linear elpd_loo = -105.7
The jump model scores higher (closer to zero), so on out-of-sample prediction it wins.
To help understand whether the gap is real, we compute its standard error:
diff = elpd_jump - elpd_lin
n = diff.size
print(f"elpd difference = {diff.sum():.1f}")
print(f"standard error = {jnp.sqrt(n) * diff.std():.1f}")
elpd difference = 11.8
standard error = 5.7
The jump model is ahead by about twelve points, against a standard error near six — a bit over two standard errors, so a real improvement.
113.4.4. ArviZ implementation#
In practice, most people do this calculation with a library like ArviZ.
We hand it the same pointwise log-likelihoods, packaged in its data format.
import arviz as az
import xarray as xr
def to_arviz(mcmc, model, u):
"Package a NumPyro fit for ArviZ, with pointwise log-likelihoods."
idata = az.from_numpyro(mcmc)
ll = np.asarray(log_likelihood(model, mcmc.get_samples(),
u=jnp.asarray(u))["u_obs"])
ll = ll.reshape(mcmc.num_chains, -1, ll.shape[-1])
idata["log_likelihood"] = xr.Dataset(
{"u_obs": (("chain", "draw", "obs"), ll)})
return idata
az.compare({
"linear": to_arviz(mcmc_lin, linear_model, u_annual),
"jump": to_arviz(mcmc_jump, jump_model, u_annual),
})
| rank | elpd_diff | dse | p_worse | diag_diff | diag_elpd | p | elpd | se | weight | |
|---|---|---|---|---|---|---|---|---|---|---|
| jump | 0 | 0.0 | 0.0 | NaN | 6.5 | -90.0 | 9.2 | 0.87 | ||
| linear | 1 | -10.0 | 5.7 | 0.98 | N < 100 | 3.6 | -110.0 | 7.3 | 0.13 |
Here az.compare confirms the hand calculation and ranks the jump model first (its
table rounds the scores for display).
113.4.5. Accommodating the time series structure#
Leave-one-out drops one transition at a time, but it does not respect the order of time.
When it scores the step from \(u_t\) to \(u_{t+1}\), the model doing the scoring was fit on data lying on both sides of that step — including the neighboring values \(u_t\) and \(u_{t+1}\) themselves.
So the held-out point was never truly unseen.
The formally correct measure is leave-future-out cross-validation.
It only ever predicts forward: fit the model on \(u_1, \dots, u_t\), score its one-step-ahead forecast of \(u_{t+1}\), then expand the window by one step and repeat.
Now the model is genuinely blind to the future it is asked to predict, exactly as in real forecasting.
The price is computational.
Leave-one-out reused a single fit through its importance-sampling shortcut, but leave-future-out has no such trick.
For our short annual series leave-future-out cross-validation is still feasible but for a long series it can become prohibitive.
(We ran it without including it here and the conclusion holds: under leave-future-out the jump model still beats the linear one, by an even larger margin.)
113.5. Conclusion#
In A Linear Model of Unemployment we applied a linear AR(1) model with Gaussian shocks to unemployment and found high levels of persistence in monthly data.
We also argued that the model is overly simplistic.
Here we found that the feature the linear model most conspicuously misses — the asymmetry of recessions — can be addressed by modeling the distribution of the shocks.
A linear reversion with large, one-sided innovations matches the right-skewed, heavy-tailed shocks that the Gaussian model could not.
Cross-validation prefers it.
An even better model than the one considered above would let the economy switch between persistent expansion and recession regimes.
This is the Markov-switching approach, which we leave to further reading.
113.6. Exercises#
Exercise 113.1
The jump component here is symmetric within itself, \(N(\mu_J, \sigma_J^2)\).
Replace the two-component mixture with a single skew-normal (or Student-\(t\) with a skew) innovation, refit, and compare to the mixture model with az.compare.
Does the simpler skewed shock do as well as the mixture?