Jekyll2019-11-03T23:04:44+00:00https://fabiandablander.com/feed.xmlFabian DablanderPhD Student Methods & StatisticsFabian DablanderA brief primer on Variational Inference2019-10-30T12:00:00+00:002019-10-30T12:00:00+00:00https://fabiandablander.com/r/Variational-Inference<link rel="stylesheet" href="../highlight/styles/default.css" />
<script src="../highlight/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<script>$('pre.stan code').each(function(i, block) {hljs.highlightBlock(block);});</script>
<p>Bayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let’s start unpacking!</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>Bayes’ rule states that</p>
<script type="math/tex; mode=display">\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,</script>
<p>where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> Bayes’ rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much <a href="https://xkcd.com/2117/">easier</a> and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization.</p>
<h1 id="variational-objective">Variational objective</h1>
<p>We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution:</p>
<script type="math/tex; mode=display">q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace ,</script>
<p>where $\text{KL}(. \lvert \lvert.)$ denotes the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a>:</p>
<script type="math/tex; mode=display">\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) \, \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace .</script>
<p>We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z} \mid \mathbf{x})\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x})\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \int q(\mathbf{z}) \, \text{log } p(\mathbf{x}) \, \mathrm{d}\mathbf{z} \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x})}_{\text{Nemesis}} \enspace ,
\end{aligned} %]]></script>
<p>where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x}) \right) \\[.5em]
&= -\left(\mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x}) - \text{log } p(\mathbf{x})}_{\text{Nemesis perishes}}\right) \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \enspace .
\end{aligned} %]]></script>
<p>We can expand the joint probability to get more insight into this equation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= \underbrace{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z})\right]}_{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]} - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z})}{q(\mathbf{z})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z})\right) \enspace .
\end{aligned} %]]></script>
<p>This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017).</p>
<p>ELBO stands for <em>evidence lower bound</em>. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x})\right) \\[.5em]
\text{log } p(\mathbf{x}) &= \text{ELBO}(q) + \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \\[.5em]
\text{log } p(\mathbf{x}) &\geq \text{ELBO}(q) \enspace ,
\end{aligned} %]]></script>
<p>since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see <a href="https://fabiandablander.com/r/Law-of-Practice.html">this</a> and <a href="https://fabiandablander.com/r/Bayes-Potter.html">this</a> blog post.</p>
<h1 id="why-variational">Why variational?</h1>
<p>Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the <em>single best value</em> that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">this</a> blog post). In our setting, we do not want to find a single best value but rather a <em>single best function</em>. To do this, we can use <em>variational calculus</em> from which variational inference derives its name (Bishop, 2006, p. 462).</p>
<p>A function takes an input value and returns an output value. We can define a <em>functional</em> which takes a whole function and returns an output value. The <em>entropy</em> of a probability distribution is a widely used functional:</p>
<script type="math/tex; mode=display">\text{H}[p] = \int p(x) \, \text{log } p(x) \mathrm{d} x \enspace ,</script>
<p>which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional.</p>
<p>In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem.</p>
<h1 id="mean-field-variational-family">Mean-field variational family</h1>
<p>A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density:</p>
<script type="math/tex; mode=display">q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j) \enspace .</script>
<p>Note that this <em>mean-field variational family</em> cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem.</p>
<p>We start from our definition of the ELBO and apply the mean-field assumption:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]
&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z}\enspace .
\end{aligned} %]]></script>
<p>In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q_j) &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z} \\[.5em]
&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j - \underbrace{\int \prod_{i\neq j}^m q_i(z_i) \, \text{log} \prod_{i\neq j}^m q_i(z_i) \, \mathrm{d}\mathbf{z}_{-j}}_{\text{Constant with respect to } q_j(z_j)} \\[.5em]
&\propto \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \left(\int \prod_{i\neq j}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z}_{-j}\right)\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \, \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \enspace .
\end{aligned} %]]></script>
<p>One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution</p>
<script type="math/tex; mode=display">\text{log } \tilde{p}{(\mathbf{x}, z_j)} = \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathcal{Z} \enspace ,</script>
<p>where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q_j) &\propto \int q_j(z_j) \, \text{log } \tilde{p}{(\mathbf{x}, z_j)} \, \mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \, \text{log } \frac{\tilde{p}{(\mathbf{x}, z_j)}}{q_j(z_j)} \, \mathrm{d}z_j \\[.5em]
&= -\int q_j(z_j) \, \text{log } \frac{q_j(z_j)}{\tilde{p}{(\mathbf{x}, z_j)}} \, \mathrm{d}z_j \\[.5em]
&= -\text{KL}\left(q_j(z_j) \, \lvert\lvert \, \tilde{p}(\mathbf{x}, z_j)\right) \enspace .
\end{aligned} %]]></script>
<p>Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
q_j^\star(z_j) &= \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right] - \mathcal{Z}\right) \\[.5em]
&= \frac{\text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right)}{\int \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right) \mathrm{d}z_j} \enspace ,
\end{aligned} %]]></script>
<p>see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as <em>Coordinate Ascent Variational Inference</em> (CAVI). Further, note that</p>
<script type="math/tex; mode=display">p(z_j \mid \mathbf{z}_{-j}, \mathbf{x}) = \frac{p(z_j, \mathbf{z}_{-j}, \mathbf{x})}{p(\mathbf{z}_{-j}, \mathbf{x})} \propto p(z_j, \mathbf{z}_{-j}, \mathbf{x}) \enspace ,</script>
<p>which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks <em>a lot</em> like Gibbs sampling, which we discussed in detail in a <a href="https://fabiandablander.com/r/Spike-and-Slab.html">previous</a> blog post. In the next section, we implement CAVI for a simple linear regression problem.</p>
<h1 id="application-linear-regression">Application: Linear regression</h1>
<p>In a <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">previous</a> blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
y &\sim \mathcal{N}(\beta x, \sigma^2) \\[.5em]
\beta &\sim \mathcal{N}(0, \sigma^2 \tau^2) \\[.5em]
\sigma^2 &\propto \frac{1}{\sigma^2} \enspace ,
\end{aligned} %]]></script>
<p>where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys’ prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification:</p>
<script type="math/tex; mode=display">\text{Var}\left[\frac{\beta}{\sigma}\right] = \frac{1}{\sigma^2} \text{Var}[\beta] = \frac{\sigma^2 \tau^2}{\sigma^2} = \tau^2 \enspace .</script>
<p>As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$.</p>
<h2 id="variational-density-for-sigma2">Variational density for $\sigma^2$</h2>
<p>Our optimal variational density $q^\star(\sigma^2)$ is given by:</p>
<script type="math/tex; mode=display">q^\star(\sigma^2) \propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \enspace .</script>
<p>To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\sigma^2 \mid \mathbf{y}, \beta) &\propto p(\mathbf{y} \mid \sigma^2, \beta) \, p(\beta) \, p(\sigma^2) \\[.5em]
&= \prod_{i=1}^n (2\pi)^{-\frac{1}{2}} \left(\sigma^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2} \left(y_i - \beta x_i\right)^2\right) \underbrace{(2\pi)^{-\frac{1}{2}} \left(\sigma^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}_{p(\beta)} \underbrace{\left(\sigma^2\right)^{-1}}_{p(\sigma^2)} \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&\propto\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{2\sigma^2} \underbrace{\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)}_{A}\right) \enspace ,
\end{aligned} %]]></script>
<p>which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
q^\star(\sigma^2) &\propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{2\sigma^2}A \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1}\right] - \mathbb{E}_{q(\beta)}\left[\frac{1}{2\sigma^2}A \right]\right) \\[.5em]
&= \text{exp}\left(\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \\[.5em]
&= \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \enspace .
\end{aligned} %]]></script>
<p>This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at:</p>
<script type="math/tex; mode=display">q^\star(\sigma^2)= \frac{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \underbrace{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]}_{\nu}\right) \enspace .</script>
<p>Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$.</p>
<h2 id="variational-density-for-beta">Variational density for $\beta$</h2>
<p>Our optimal variational density $q^\star(\beta)$ is given by:</p>
<script type="math/tex; mode=display">q^\star(\beta) \propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \enspace ,</script>
<p>and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \sigma^2) &\propto p(\mathbf{y} \mid \beta, \sigma^2) \, p(\beta) \, p(\sigma^2) \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^ny_i^2- 2 \beta \sum_{i=1}^n y_i x_i + \beta^2 \sum_{i=1}^n x_i^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&\propto \text{exp}\left(-\frac{1}{2\sigma^2} \left( \beta^2 \left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) - 2 \beta \sum_{i=1}^n y_i x_i\right)\right) \\[.5em]
&=\text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta^2 - 2 \beta \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)\right) \\[.5em]
&\propto \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,
\end{aligned} %]]></script>
<p>where we have “completed the square” (see also <a href="https://fabiandablander.com/statistics/Two-Properties.html">this</a> blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
q^\star(\beta) &\propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right]\right) \\[.5em]
&= \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,
\end{aligned} %]]></script>
<p>which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields:</p>
<script type="math/tex; mode=display">q^\star(\beta) = \left(2\pi\underbrace{\frac{\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}}_{\sigma^2_{\beta}}\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left(\beta - \underbrace{\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}}_{\mu_{\beta}}\right)^2\right) \enspace ,</script>
<p>Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not.</p>
<p>To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys’s prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations.</p>
<h2 id="removing-expectations">Removing expectations</h2>
<p>Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, for the variational density $q^\star(\sigma^2)$ we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\beta)}\left[A \right] &= \mathbb{E}_{q(\beta)}\left[\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right] \\[.5em]
&= \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mathbb{E}_{q(\beta)}\left[\beta\right] + \sum_{i=1}^n x_i^2 \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] + \frac{1}{\tau^2} \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] \enspace .
\end{aligned} %]]></script>
<p>Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q(\beta)}[\beta^2] = \text{Var}_{q(\beta)}\left[\beta\right] + \mathbb{E}_{q(\beta)}[\beta]^2
= \sigma^2_{\beta} + \mu_{\beta}^2 \enspace ,</script>
<p>the expectation becomes:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q(\beta)}\left[A\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) \enspace .</script>
<p>For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. Note that since $\sigma^2$ follows an inverse Gamma distribution, $1 / \sigma^2$ follows a Gamma distribution which has mean:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] &= \frac{n + 1}{2} \left(\frac{1}{2}\mathbb{E}_{q(\beta)}\left[A \right]\right)^{-1} \\[.5em]
&= \frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \enspace .
\end{aligned} %]]></script>
<h2 id="monitoring-convergence">Monitoring convergence</h2>
<p>The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y}, \beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2) \right] \\[.5em]
&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \mathbb{E}_{p(\beta, \sigma^2)}\left[\text{log } p(\beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2)\right] \\[.5em]
&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \underbrace{\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right]}_{-\text{KL}\left(q(\beta, \sigma^2) \, \lvert\lvert \, p(\beta, \sigma^2)\right)}\enspace .
\end{aligned} %]]></script>
<p>Let’s take a deep breath and tackle the second term first:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{p(\beta \mid \sigma^2)}{q(\beta)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\left(2\pi\sigma^2\tau^2\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}{\left(2\pi\sigma^2_\beta\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2_\beta} (\beta - \mu_\beta)^2\right)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\frac{1}{\sigma^2\tau^2} \beta^2}{\frac{1}{\sigma^2_\beta} (\beta - \mu_\beta)^2}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\text{log}\frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\sigma^2_\beta + \mu_\beta^2}{\sigma^2\tau^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\sigma^2)\right] - \mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]\enspace .
\end{aligned} %]]></script>
<p>Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration:</p>
<!-- We proceed by expanding the last expectation: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{\sigma^{-2}}{\frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right)}\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{1}{\left(\sigma^2\right)^{-\frac{n + 1}{2}}} - \frac{\sigma^2}{\nu}\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\nu}\mathbb{E}_{q(\sigma^2)}\left[\sigma^2\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\nu} \frac{\nu}{\frac{n + 1}{2} - 1} \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\frac{n + 1}{2} - 1} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<script type="math/tex; mode=display">\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right] = \int q(\sigma^2) \, \text{log } q(\sigma^2) \, \mathrm{d}\sigma^2 \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\text{log } q(\sigma^2)}_{\sigma^2 \, \sim \, q(\sigma^2)} \enspace ,</script>
<p>We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] &= \int \int q(\beta) \, q(\sigma^2) \, \text{log } p(\mathbf{y} \mid \beta, \sigma^2) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]
&=\int q(\beta) \int q(\sigma^2) \, \text{log} \left(\left(2\pi\sigma^2\right)^{-\frac{n}{2}}\text{exp}\left(-\frac{1}{2\sigma^2}
\sum_{i=1}^n (y_i - x_i\beta)^2\right)\right) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]
&= \frac{n}{4} \text{log}\left(2\pi\right)\int q(\beta) \left(\sum_{i=1}^n (y_i - x_i\beta)^2\right) \, \mathrm{d}\beta\int q(\sigma^2) \, \, \text{log} \left(\sigma^2\right)\frac{1}{\sigma^2} \, \mathrm{d}\sigma \enspace .
\end{aligned} %]]></script>
<p>Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q(\beta)}\left[\sum_{i=1}^n (y_i - x_i\beta)^2\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right) \enspace .</script>
<!-- Piecing it all together, the ELBO is given by: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{ELBO}(\mu_\beta, \sigma_\beta^2, \tau^2, \tau^2) &= \frac{n}{4} \text{log}\left(2\pi\right)\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right)\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log} \left(\sigma^2\right)\frac{1}{\sigma^2}\right]\\[.5em] -->
<!-- &+ \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \\[.5em] -->
<!-- &+ \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\frac{n + 1}{2} - 1} \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<p>In the next section, we implement the algorithm for our linear regression problem in R.</p>
<h1 id="implementation-in-r">Implementation in R</h1>
<p>Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'MCMCpack'</span><span class="p">)</span><span class="w">
</span><span class="cd">#' Computes the ELBO for the linear regression example</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @param y univariate outcome variable</span><span class="w">
</span><span class="cd">#' @param x univariate predictor variable</span><span class="w">
</span><span class="cd">#' @param beta_mu mean of the variational density for \beta</span><span class="w">
</span><span class="cd">#' @param beta_sd standard deviation of the variational density for \beta</span><span class="w">
</span><span class="cd">#' @param nu parameter of the variational density for \sigma^2</span><span class="w">
</span><span class="cd">#' @param nr_samples number of samples for the Monte carlo integration</span><span class="w">
</span><span class="cd">#' @returns ELBO</span><span class="w">
</span><span class="n">compute_elbo</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e4</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_y2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_yx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="c1"># Takes a function and computes its expectation with respect to q(\beta)</span><span class="w">
</span><span class="n">E_q_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dnorm</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">fn</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="o">-</span><span class="kc">Inf</span><span class="p">,</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Takes a function and computes its expectation with respect to q(\sigma^2)</span><span class="w">
</span><span class="n">E_q_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">fn</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Compute expectations of log p(\sigma^2)</span><span class="w">
</span><span class="n">E_log_p_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="c1"># Compute expectations of log p(\beta \mid \sigma^2)</span><span class="w">
</span><span class="n">E_log_p_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="w">
</span><span class="nf">log</span><span class="p">(</span><span class="n">tau2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="p">(</span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">tau2</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">tau2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># Compute expectations of the log variational densities q(\beta)</span><span class="w">
</span><span class="n">E_log_q_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_beta</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">log</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="c1"># E_log_q_sigma2 <- E_q_sigma2(function(x) log(dinvgamma(x, (n + 1)/2, nu))) # fails</span><span class="w">
</span><span class="c1"># Compute expectations of the log variational densities q(\sigma^2)</span><span class="w">
</span><span class="n">sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rinvgamma</span><span class="p">(</span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w">
</span><span class="n">E_log_q_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">sigma2</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)))</span><span class="w">
</span><span class="c1"># Compute the expected log likelihood</span><span class="w">
</span><span class="n">E_log_y_b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_y2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">sum_yx</span><span class="o">*</span><span class="n">beta_mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_mu</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="n">sum_x2</span><span class="w">
</span><span class="n">E_log_y_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">E_log_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">4</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_log_y_b</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_log_y_sigma2</span><span class="w">
</span><span class="c1"># Compute and return the ELBO</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_log_y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E_log_p_beta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E_log_p_sigma2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">E_log_q_beta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">E_log_q_sigma2</span><span class="w">
</span><span class="n">ELBO</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The function below implements coordinate ascent mean-field variational inference for our simple linear regression problem. Recall that the variational parameters are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\nu &= \frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right) \\[.5em]
\mu_\beta &= \frac{\sum_{i=1}^N y_i x_i}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \\[.5em]
\sigma^2_\beta &= \frac{\left(\frac{n + 1}{2}\right) \nu^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \enspace .
\end{aligned} %]]></script>
<p>The following function implements the iterative updating of these variational parameters until the ELBO has converged.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="cd">#' Implements CAVI for the linear regression example</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @param y univariate outcome variable</span><span class="w">
</span><span class="cd">#' @param x univariate predictor variable</span><span class="w">
</span><span class="cd">#' @param tau2 prior variance for the standardized effect size</span><span class="w">
</span><span class="cd">#' @returns parameters for the variational densities and ELBO</span><span class="w">
</span><span class="n">lmcavi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e5</span><span class="p">,</span><span class="w"> </span><span class="n">epsilon</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e-2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_y2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_yx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="c1"># is not being updated through variational inference!</span><span class="w">
</span><span class="n">beta_mu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_yx</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_mu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">beta_mu</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">has_converged</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">epsilon</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">compute_elbo</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w">
</span><span class="c1"># while the ELBO has not converged</span><span class="w">
</span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">has_converged</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]][</span><span class="n">j</span><span class="p">],</span><span class="w"> </span><span class="n">ELBO</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">nu_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]][</span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="n">beta_sd_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]][</span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="c1"># used in the update of beta_sd and nu</span><span class="w">
</span><span class="n">E_qA</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_y2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">sum_yx</span><span class="o">*</span><span class="n">beta_mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">beta_sd_prev</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_mu</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">)</span><span class="w">
</span><span class="c1"># update the variational parameters for sigma2 and beta</span><span class="w">
</span><span class="n">nu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_qA</span><span class="w">
</span><span class="n">beta_sd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(((</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">E_qA</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">))</span><span class="w">
</span><span class="c1"># update results object</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]],</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]],</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]],</span><span class="w"> </span><span class="n">ELBO</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute new ELBO</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">compute_elbo</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Let’s run this on a simulated data set of size $n = 100$ with a true coefficient of $\beta = 0.30$ and a true error variance of $\sigma^2 = 1$. We assign $\beta$ a Gaussian prior with variance $\tau^2 = 0.25$ so that values for $\lvert \beta \rvert$ larger than two standard deviations ($0.50$) <a href="(https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)">receive about $0.68$</a> prior probability.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gen_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="o">*</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gen_dat</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">0.30</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">mc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lmcavi</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">mc</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## $nu
## [1] 5.00000 88.17995 45.93875 46.20205 46.19892 46.19895
##
## $beta_mu
## [1] 0.2800556
##
## $beta_sd
## [1] 1.00000000 0.08205605 0.11368572 0.11336132 0.11336517 0.11336512
##
## $ELBO
## [1] 0.0000 -297980.0495 493.4807 -281.4578 -265.1289
## [6] -265.3197</code></pre></figure>
<p>From the output, we see that the ELBO and the variational parameters have converged. In the next section, we compare these results to results obtained with Stan.</p>
<h2 id="comparison-with-stan">Comparison with Stan</h2>
<p>Whenever one goes down a rabbit hole of calculations, it is good to sanity check one’s results. Here, we use Stan’s variational inference scheme to check whether our results are comparable. It assumes a Gaussian variational density for each parameter after transforming them to the real line and automates inference in a “black-box” way so that no problem-specific calculations are required (see Kucukelbir, Ranganath, Gelman, & Blei, 2015). Subsequently, we compare our results to the exact posteriors arrived by Markov chain Monte carlo. The simple linear regression model in Stan is:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">data {
int<lower=0> n;
vector[n] y;
vector[n] x;
real tau;
}
parameters {
real b;
real<lower=0> sigma;
}
model {
target += -log(sigma);
target += normal_lpdf(b | 0, sigma*tau);
target += normal_lpdf(y | b*x, sigma);
}</code></pre></figure>
<p>We use Stan’s black-box variational inference scheme:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'rstan'</span><span class="p">)</span><span class="w">
</span><span class="c1"># save the above model to a file and compile it</span><span class="w">
</span><span class="c1"># model <- stan_model(file = 'regression.stan')</span><span class="w">
</span><span class="n">stan_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">dat</span><span class="p">),</span><span class="w"> </span><span class="s1">'x'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="s1">'tau'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vb</span><span class="p">(</span><span class="w">
</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_dat</span><span class="p">,</span><span class="w"> </span><span class="n">output_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20000</span><span class="p">,</span><span class="w"> </span><span class="n">adapt_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w">
</span><span class="n">init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'b'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.30</span><span class="p">,</span><span class="w"> </span><span class="s1">'sigma'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>This gives similar estimates as ours:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Inference for Stan model: variational-regression.
## 1 chains, each with iter=20000; warmup=0; thin=1;
## post-warmup draws per chain=20000, total post-warmup draws=20000.
##
## mean sd 2.5% 25% 50% 75% 97.5%
## b 0.28 0.13 0.02 0.19 0.28 0.37 0.54
## sigma 0.99 0.09 0.82 0.92 0.99 1.05 1.18
## lp__ 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## Approximate samples were drawn using VB(meanfield) at Mon Nov 4 00:03:20 2019.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## We recommend genuine 'sampling' from the posterior distribution for final inferences!</code></pre></figure>
<p>Their recommendation is prudent. If you run the code with different seeds, you can get quite different results. For example, the posterior mean of $\beta$ can range from $0.12$ to $0.45$, and the posterior standard deviation can be as low as $0.03$; in all these settings, Stan indicates that the ELBO has converged, but it seems that it has converged to a different local optimum for each run. (For seed = 3, Stan gives completely nonsensical results). Stan warns that the algorithm is experimental and may be unstable, and it is probably wise to not use it in production.</p>
<p><em>Update</em>: As Ben Goodrich points out in the comments, there is some cool work on providing diagnostics for variational inference; see <a href="https://statmodeling.stat.columbia.edu/2018/06/27/yes-work-evaluating-variational-inference/">this</a> blog post and the paper by Yao, Vehtari, Simpson, & Gelman (<a href="https://arxiv.org/abs/1802.02538">2018</a>) as well as the paper by Huggins, Kasprzak, Campbell, & Broderik (<a href="https://arxiv.org/abs/1910.04102">2019</a>).</p>
<p>Although the posterior distribution for $\beta$ and $\sigma^2$ is available in closed-form (see the <em>Post Scriptum</em>), we check our results against exact inference using Markov chain Monte carlo by visual inspection.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_dat</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<p>The Figure below overlays our closed-form results to the histogram of posterior samples obtained using Stan.</p>
<p><img src="/assets/img/2019-10-30-Variational-Inference.Rmd/unnamed-chunk-10-1.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" style="display: block; margin: auto;" /></p>
<p>Note that the posterior variance of $\beta$ is slightly <em>overestimated</em> when using our variational scheme. This is in contrast to the fact that variational inference generally <em>underestimates</em> variances. Note also that Bayesian inference using Markov chain Monte Carlo is very fast on this simple problem. However, the comparative advantage of variational inference becomes clear by increasing the sample size: for sample sizes as large as $n = 100000$, our variational inference scheme takes less then a tenth of a second!</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen how to turn an integration problem into an optimization problem using variational inference. Assuming that the variational densities are independent, we have derived the optimal variational densities for a simple linear regression problem with one predictor. While using variational inference for this problem is unnecessary since everything is available in closed-form, I have focused on such a simple problem so as to not confound this introduction to variational inference by the complexity of the model. Still, the derivations were quite lengthy. They were also entirely specific to our particular problem, and thus generic “black-box” algorithms which avoid problem-specific calculations hold great promise.</p>
<p>We also implemented coordinate ascent mean-field variational inference (CAVI) in R and compared our results to results obtained via variational and exact inference using Stan. We have found that one probably should not trust Stan’s variational inference implementation, and that our results closely correspond to the exact procedure. For more on variational inference, I recommend the excellent review article by Blei, Kucukelbir, and McAuliffe (<a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773">2017</a>).</p>
<hr />
<p><em>I would like to thank Don van den Bergh for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="normal-inverse-gamma-distribution">Normal-inverse-gamma Distribution</h3>
<p>The posterior distribution is a <a href="https://en.wikipedia.org/wiki/Normal-inverse-gamma_distribution">Normal-inverse-gamma distribution</a>:</p>
<script type="math/tex; mode=display">p(\beta, \sigma^2 \mid \mathbf{y}) = \frac{\gamma^{\alpha}}{\Gamma\left(\alpha\right)} \left(\sigma^2\right)^{-\alpha - 1} \text{exp}\left(-\frac{2\gamma + \lambda\left(\beta - \mu\right)^2}{2\sigma^2}\right) \enspace ,</script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mu &= \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}} \\[.5em]
\lambda &= \sum_{i=1}^n x_i + \frac{1}{\tau^2} \\[.5em]
\alpha &= \frac{n + 1}{2} \\[.5em]
\gamma &= \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2 - \frac{\left(\sum_{i=1}^n y_i x_i\right)^2}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}}\right)\right) \enspace .
\end{aligned} %]]></script>
<p>Note that the marginal posterior distribution for $\beta$ is actually a Student-t distribution, contrary to what we assume in our variational inference scheme.</p>
<h2 id="references">References</h2>
<ul>
<li>Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (<a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773">2017</a>). Variational inference: A review for statisticians. <em>Journal of the American Statistical Association, 112</em>(518), 859-877.</li>
<li>Huggins, J. H., Kasprzak, M., Campbell, T., & Broderick, T. (<a href="https://arxiv.org/abs/1910.04102">2019</a>). Practical Posterior Error Bounds from Variational Objectives. <em>arXiv preprint</em> arXiv:1910.04102.</li>
<li>Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (<a href="http://papers.nips.cc/paper/5758-automatic-variational-inference-in-stan">2015</a>). Automatic variational inference in Stan. In <em>Advances in Neural Information Processing Systems</em> (pp. 568-576).</li>
<li>Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (<a href="http://www.jmlr.org/papers/volume18/16-107/16-107.pdf">2017</a>). Automatic differentiation variational inference. <em>The Journal of Machine Learning Research, 18</em>(1), 430-474.</li>
<li>Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (<a href="https://arxiv.org/abs/1802.02538">2018</a>). Yes, but did it work?: Evaluating variational inference. <em>arXiv preprint</em> arXiv:1802.02538.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderBayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let’s start unpacking! Preliminaries Bayes’ rule states that where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.1 Bayes’ rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much easier and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization. Variational objective We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution: where $\text{KL}(. \lvert \lvert.)$ denotes the Kullback-Leibler divergence: We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that: where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity: We can expand the joint probability to get more insight into this equation: This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017). ELBO stands for evidence lower bound. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence: since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see this and this blog post. Why variational? Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the single best value that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example this blog post). In our setting, we do not want to find a single best value but rather a single best function. To do this, we can use variational calculus from which variational inference derives its name (Bishop, 2006, p. 462). A function takes an input value and returns an output value. We can define a functional which takes a whole function and returns an output value. The entropy of a probability distribution is a widely used functional: which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional. In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem. Mean-field variational family A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density: Note that this mean-field variational family cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem. We start from our definition of the ELBO and apply the mean-field assumption: In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed: One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that: Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by: see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as Coordinate Ascent Variational Inference (CAVI). Further, note that which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks a lot like Gibbs sampling, which we discussed in detail in a previous blog post. In the next section, we implement CAVI for a simple linear regression problem. Application: Linear regression In a previous blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is: where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys’ prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification: As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$. Variational density for $\sigma^2$ Our optimal variational density $q^\star(\sigma^2)$ is given by: To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write: which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write: This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at: Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$. Variational density for $\beta$ Our optimal variational density $q^\star(\beta)$ is given by: and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write: where we have “completed the square” (see also this blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations: which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields: Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not. To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys’s prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations. Removing expectations Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, for the variational density $q^\star(\sigma^2)$ we write: Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that: the expectation becomes: For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. Note that since $\sigma^2$ follows an inverse Gamma distribution, $1 / \sigma^2$ follows a Gamma distribution which has mean: Monitoring convergence The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write: Let’s take a deep breath and tackle the second term first: Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration: We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient: Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically: In the next section, we implement the algorithm for our linear regression problem in R. Implementation in R Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO: The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation. ↩Harry Potter and the Power of Bayesian Constrained Inference2019-09-28T08:00:00+00:002019-09-28T08:00:00+00:00https://fabiandablander.com/r/Bayes-Potter<p>If you are reading this, you are probably a Ravenclaw. Or a Hufflepuff. Certainly not a Slytherin … but maybe a Gryffindor?</p>
<p>In this blog post, we let three subjective Bayesians predict the outcome of ten coin flips. We will derive prior predictions, evaluate their accuracy, and see how fortune favours the bold. We will also discover a neat trick that allows one to easily compute Bayes factors for models with parameter restrictions compared to models without such restrictions, and use it to answer a question we truly care about: are Slytherins really the bad guys?</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>As in a <a href="https://fabiandablander.com/r/Regularization.html">previous blog post</a>, we start by studying coin flips. Let $\theta \in [0, 1]$ be the bias of the coin and let $y$ denote the number of heads out of $n$ coin flips. We use the Binomial likelihood</p>
<script type="math/tex; mode=display">p(y \mid \theta) = {n \choose y} \theta^y (1 - \theta)^{n - y} \enspace ,</script>
<p>and a Beta prior for $\theta$:</p>
<script type="math/tex; mode=display">p(\theta) = \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \enspace .</script>
<p>This prior is <em>conjugate</em> for this likelihood which means that the posterior is again a Beta distribution. The Figure below shows two examples of this.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>In this blog post, we will use a <em>prior predictive</em> perspective on model comparison by means of Bayes factors. For an extensive contrast with a perspective based on <em>posterior prediction</em>, see <a href="https://fabiandablander.com/r/Law-of-Practice.html">this blog post</a>. The Bayes factor indicates how much better a model $\mathcal{M}_1$ predicts the data $y$ <em>relative to another model</em> $\mathcal{M}_0$:</p>
<script type="math/tex; mode=display">\text{BF}_{10} = \frac{p(y \mid \mathcal{M}_1)}{p(y \mid \mathcal{M}_0)} \enspace ,</script>
<p>where we can write the <em>marginal likelihood</em> of a generic model $\mathcal{M}$ more complicatedly to see the dependence on the model’s priors:</p>
<script type="math/tex; mode=display">p(y \mid \mathcal{M}) = \int_{\Theta} p(y \mid \theta, \mathcal{M}) \, p(\theta \mid \mathcal{M}) \, \mathrm{d}\theta \enspace .</script>
<p>After these preliminaries, in the next section, we visit Ron, Harry, and Hermione in Hogwarts.</p>
<h1 id="the-hogwarts-prediction-contest">The Hogwarts prediction contest</h1>
<p>Ron, Harry, and Hermione just came back from a straining adventure — Death Eaters and all. They deserve a break, and Hermione suggests a small prediction contest to relax. Ron is put off initially; relaxing by thinking? That’s not his style. Harry does not care either way; both are eventually convinced.</p>
<p>The goal of the contest is to accuratly predict the outcome of $n = 10$ coin flips. Luckily, this is not a particularly complicated problem to model, and we can use the Binomial likelihood we have discussed above. In the next section, Ron, Harry, and Hermione — all subjective Bayesians — clearly state their prior beliefs which is required to make predictions.</p>
<h2 id="prior-beliefs">Prior beliefs</h2>
<p>Ron is not big on thinking, and so trusts his previous intuitions that coins are usually unbiased; he specifies a point mass on $\theta = 0.50$ as his prior. Harry spreads his bets evenly, and believes that all chances governing the coin flip’s outcome are equally likely; he puts a uniform prior on $\theta$. Hermione, on the other hand, believes that the coin <em>cannot</em> be biased towards tails; instead, she believes that all values $\theta \in [0.50, 1]$ are equally likely. She thinks this because Dobby — the house elf — is the one who throws the coin, and she has previously observed him passing time by flipping coins, which strangely almost always landed up heads. To sum up, their priors are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{Ron} &: \theta = 0.50 \\[.5em]
\text{Harry} &: \theta \sim \text{Beta}(1, 1) \\[.5em]
\text{Hermione} &: \theta \sim \text{Beta}(1, 1)\mathbb{I}(0.50, 1) \enspace ,
\end{aligned} %]]></script>
<p>which are visualized in the Figure below.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In the next section, the three use their beliefs to make probabilistic predictions.</p>
<h2 id="prior-predictions">Prior predictions</h2>
<p>Ron, Harry, and Hermione are subjective Bayesians and therefore evaluate their performance by their respective predictive accuracy. Each of the trio has a <em>prior predictive distribution</em>. For Ron, true to character, this is the easiest to derive. We associate model $\mathcal{M}_0$ with him and write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y \mid \mathcal{M}_0) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_0) \, p(\theta \mid \mathcal{M}_0) \, \mathrm{d}\theta \\[.5em]
&= {n \choose y} 0.50^y (1 - 0.50)^{n - y} \enspace ,
\end{aligned} %]]></script>
<p>where the integral — the sum! — is just over the value $\theta = 0.50$. Ron’s prior predictive distribution is simply a Binomial distribution. He is delighted by this fact, and enjoys a short rest while the others derive their predictions.</p>
<p>It is Harry’s turn, and he is a little put off by his integration problem. However, he realizes that the integrand is an unnormalized Beta distribution, and swiftly writes down its normalizing constant, the Beta function. Associating $\mathcal{M}_1$ with him, his steps are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y \mid \mathcal{M}_1) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \frac{1}{\text{B}(1, 1)} \theta^{1 - 1} (1 - \theta)^{1 - 1} \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \mathrm{d}\theta \\[.5em]
&= {n \choose y} \text{Beta}(y + 1, n - y + 1) \enspace ,
\end{aligned} %]]></script>
<p>which is a <a href="https://en.wikipedia.org/wiki/Beta-binomial_distribution">Beta-Binomial distribution</a> with $\alpha = \beta = 1$.</p>
<p>Hermione’s integral is the most complicated of the three, but she is also the smartest of the bunch. She is a master of the wizardry that is computer programming, which allows her to solve the integral numerically.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> We associate $\mathcal{M}_r$, which stands for <em>restricted</em> model, with her and write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y \mid \mathcal{M}_r) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta \\[.5em]
&= \int_{0.50}^1 {n \choose y} \theta^y (1 - \theta)^{n - y} \, 2 \, \mathrm{d}\theta \\[.5em]
&= 2{n \choose y}\int_{0.50}^1 \theta^y (1 - \theta)^{n - y} \mathrm{d}\theta \enspace .
\end{aligned} %]]></script>
<p>We can draw from the prior predictive distributions by simulating from the prior and then making predictions through the likelihood. For Hermione, for example, this yields:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">nr_draws</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">20</span><span class="w">
</span><span class="n">theta_Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_draws</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">predictions_Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_draws</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">theta_Hermione</span><span class="p">)</span><span class="w">
</span><span class="n">predictions_Hermione</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 10 10 10 3 7 10 8 9 6 9 9 6 8 9 8 10 6 10 5 7</code></pre></figure>
<p>Let’s visualize Ron’s, Harry’s, and Hermione’s prior predictive distributions to get a better feeling for what they believe are likely coin flip outcomes. First, we implement their prior predictions in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Ron</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">0.50</span><span class="o">^</span><span class="n">n</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Harry</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">beta</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">int</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="n">theta</span><span class="o">^</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta</span><span class="p">)</span><span class="o">^</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">int</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Even though Ron believes that $\theta = 0.50$, this does not mean that his prior prediction puts all mass on $y = 5$; deviations from this value are plausible. Harry’s prior predictive distribution also makes sense: since he believes all values for $\theta$ to be equally likely, he should believe all outcomes are equally likely. Hermione, on the other hand, believes that $\theta \in [0.50, 1]$, so her prior probabilities for outcomes with few heads ($y < 5$) drastically decrease.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>After the three have clearly stated their prior beliefs and derived their prior predictions, Dobby throws a coin ten times. The coin comes up heads nine times. In the next section, we discuss the relative predictive performance of Ron, Harry, and Hermione based on these data.</p>
<h2 id="evaluating-predictions">Evaluating predictions</h2>
<p>To assess the relative predictive performance of Ron, Harry, and Hermione, we need to compute the probability mass of $y = 9$ for their respective prior predictive distributions. Compared to Ron, Hermione did roughly 19 times better:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Hermione</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Ron</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 18.50909</code></pre></figure>
<p>Harry, on the other hand, did about 9 times better than Ron:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Harry</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Ron</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 9.309091</code></pre></figure>
<p>With these two comparisons, we also know by how much Hermione outperformed Harry, since by transitivity we have:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_0)} \times \frac{p(y \mid \mathcal{M}_0)}{p(y \mid \mathcal{M}_1)} = \text{BF}_{r0} \times \frac{1}{\text{BF}_{10}} \approx 2 \enspace ,</script>
<p>which is indeed correct:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Hermione</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Harry</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1.988281</code></pre></figure>
<p>Note that this is also immediately apparent from the visualizations above, where Hermione’s allocated probability mass is about twice as large as Harry’s for the case where $y = 9$.</p>
<p>Hermione was bold in her prediction, and was rewarded with being favoured by a factor of two in predictive performance. Note that if her predictions would have been even bolder, say restricting her prior to $\theta \in [0.80, 1]$, she would have reaped higher rewards than a Bayes factor in favour of two. Contrast this to Dobby throwing the coin ten times and with only one heads showing. Then Harry’s marginal likelihood is still $\text{Beta}(11, 1) = \frac{1}{11}$. However, Hermione’s is not twice as much; instead, it is a mere $0.001065$, which would result in a Bayes factor of about $85$ in favour of Harry!</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Harry</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Hermione</span><span class="p">(</span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 85.33333</code></pre></figure>
<p>This means that with bold predictions, one can also lose a lot. However, this is tremendously insightful, since Hermione would immediately realize where she went wrong. For a discussion that also points out the flexibility of Bayesian model comparison, see Etz, Haaf, Rouder, & Vandeckerckhove (2018).</p>
<p>In the next section, we will discover a nice trick which simplifies the computation of the Bayes factor; we do not need to derive marginal likelihoods, but can simply look at the prior and the posterior distribution of the parameter of interest in the unrestricted model.</p>
<h1 id="prior--posterior-trick">Prior / Posterior trick</h1>
<p>As it it turns out, the relative predictive performance of Hermione compared to Harry is given by the ratio of the purple area to the blue area in the figure below.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-10-1.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" style="display: block; margin: auto;" /></p>
<p>In other words, the Bayes factor in favour of the <em>restricted</em> model (i.e., Hermione) compared to the <em>unrestricted</em> or <em>encompassing</em> model (i.e., Harry) is given by the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. We can check this numerically:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># (1 - pbeta(0.50, 10, 2)) / 0.50 would also work</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="n">dbeta</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">0.50</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1.988281</code></pre></figure>
<p>This is a very cool result which, to my knowledge, was first described in Kluglist & Hoijtink (2005). In the next section, we prove it.</p>
<h2 id="proof">Proof</h2>
<p>The proof uses two insights. First, note that we can write the priors in the restricted model, $\mathcal{M}_r$, as priors in the encompassing model, $\mathcal{M}_1$, subject to some constraints. In the Hogwarts prediction context, Hermione’s prior was a restricted version of Harry’s prior:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\theta \mid \mathcal{M}_r) &= p(\theta \mid \mathcal{M}_1)\mathbb{I}(0.50, 1) \\[1em]
&= \begin{cases} \frac{p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} & \text{if} \hspace{1em} \theta \in [0.50, 1] \\[1em] 0 & \text{otherwise}\end{cases}
\end{aligned} %]]></script>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>We have to divide by the term</p>
<script type="math/tex; mode=display">K = \int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta = 0.50 \enspace ,</script>
<p>so that the restricted prior integrates to 1, as all proper probability distributions must. As a direct consequence, note that the density of a value $\theta = \theta^{\star}$ is given by:</p>
<script type="math/tex; mode=display">p(\theta^{\star} \mid \mathcal{M}_r) = p(\theta^{\star} \mid \mathcal{M}_1) \cdot \frac{1}{K} \enspace ,</script>
<p>where $K$ is the renormalization constant. This means that we can rewrite terms which include the restricted prior in terms of the unrestricted prior from the encompassing model. This also holds for the posterior!</p>
<p>To see that we can also write the restricted posterior in terms of the unrestricted posterior from the encompassing model, note that the likelihood is the same under both models and that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\theta \mid y, \mathcal{M}_r) &= \frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K}}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K} \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \\[.5em]
&= \frac{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta}}{\int_{0.50}^1 \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(\theta \mid y, \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta} \enspace ,
\end{aligned} %]]></script>
<p>where we have to renormalize by</p>
<script type="math/tex; mode=display">Z = \int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta \enspace ,</script>
<p>which is</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pbeta</span><span class="p">(</span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 0.9941406</code></pre></figure>
<p>The figure below visualizes Harry’s and Hermione’s posterior. Sensibly, since Hermione excluded all $\theta \in [0, 0.50]$ in her prior, such values receive zero credence in her posterior. However, the difference in posterior distributions between Harry and Hermione is very weak in contrast to the difference in prior distribution. This is reflected in $Z$ being close to 1.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-14-1.png" title="plot of chunk unnamed-chunk-14" alt="plot of chunk unnamed-chunk-14" style="display: block; margin: auto;" /></p>
<p>Similar to the prior, we can write the density of a value $\theta = \theta^\star$ in terms of the encompassing model:</p>
<script type="math/tex; mode=display">p(\theta^{\star} \mid y, \mathcal{M}_r) = p(\theta^{\star} \mid y, \mathcal{M}_1) \cdot \frac{1}{Z} \enspace .</script>
<p>Now that we have established that we can write both the prior and the posterior density of parameters in the restricted model in terms of the parameters in the unrestricted model, as a second step, note that we can swap the posterior and the marginal likelihood terms in Bayes’ rule such that:</p>
<script type="math/tex; mode=display">p(y \mid \mathcal{M}_1) = \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)} \enspace ,</script>
<p>from which it follows that:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_1)} = \frac{\frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{p(\theta \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)}} \enspace .</script>
<p>Now suppose that we have values that are in line with the restriction, i.e., $\theta = \theta^{\star}$. Then:</p>
<script type="math/tex; mode=display">\begin{aligned}
\text{BF}_{r1} = \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star\mid \mathcal{M}_r)}{p(\theta^\star \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}
= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star \mid \mathcal{M}_1) \, \frac{1}{K}}{p(\theta^\star \mid y, \mathcal{M}_1) \, \frac{1}{Z}}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}
= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, \frac{1}{K}}{\frac{1}{Z}}}{p(y \mid \theta^\star, \mathcal{M}_1)} = \frac{\frac{1}{K}}{\frac{1}{Z}} = \frac{Z}{K} \enspace ,
\end{aligned}</script>
<p>where we have used the previous insights and the fact that the likelihood under $\mathcal{M}_r$ and $\mathcal{M}_1$ is the same. If we expand the constants for our previous problem, we have:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{Z}{K} = \frac{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} = \frac{p(\theta \in [0.50, 1] \mid y, \mathcal{M}_1)}{p(\theta \in [0.50, 1] \mid \mathcal{M}_1)} \enspace ,</script>
<p>which is, as claimed above, the posterior probability of values for $\theta$ that are in line with the restriction divided by the prior probability of values for $\theta$ that are in line with the restriction. Note that this holds for arbitrary restrictions of an arbitrary number of parameters (see Kluglist & Hoijtink, 2005). In the limit where we take the restriction to be infinitesimally small, that is, constrain the parameter to be a point value, this results in the Savage-Dickey density ratio (Wetzels, Grasman, & Wagenmakers, 2010).</p>
<!-- To illustrate this, assume that Hermione could have believed that $\theta$ is equally likely to be smaller $0.25$ or larger than $0.75$. Her prior and posterior are visualized in the figure below. -->
<!-- ```{r, echo = FALSE, fig.width = 10, fig.height = 5, fig.align = 'center', message = FALSE, warning = FALSE, dpi=400} -->
<!-- library('latex2exp') -->
<!-- x <- seq(.000, 1, .001) -->
<!-- par(mfrow = c(1, 2)) -->
<!-- Hermione_prior <- function(x) { -->
<!-- if (x < .25) { -->
<!-- res <- dunif(x, 0, 0.25) / 2 -->
<!-- } else { -->
<!-- res <- dunif(x, 0.75, 1) / 2 -->
<!-- } -->
<!-- res -->
<!-- } -->
<!-- Hermione_posterior <- function(x, y = 9, n = 10) { -->
<!-- fn <- function(x) { -->
<!-- Hermione_prior(x) * dbinom(y, n, x) -->
<!-- } -->
<!-- 2 * Hermione_prior(x) * dbinom(y, n, x) -->
<!-- } -->
<!-- plot( -->
<!-- x, sapply(x, Hermione_prior), xlim = c(0, 1), type = 'l', ylab = 'Density', lty = 1, -->
<!-- xlab = TeX('$\\theta$'), las = 1, main = 'Hermione\'s Prior', lwd = 3, ylim = c(0, 4), -->
<!-- cex.lab = 1.5, cex.main = 1.5, col = 'skyblue', axes = FALSE -->
<!-- ) -->
<!-- axis(1, at = seq(0, 1, .2)) #adds custom x axis -->
<!-- axis(2, las = 1) # custom y axis -->
<!-- plot( -->
<!-- x, sapply(x, Hermione_posterior), xlim = c(0, 1), type = 'l', ylab = 'Density', lty = 1, -->
<!-- xlab = TeX('$\\theta$'), las = 1, main = 'Hermione\'s Posterior', lwd = 3, ylim = c(0, 4), -->
<!-- cex.lab = 1.5, cex.main = 1.5, col = 'darkorchid1', axes = FALSE -->
<!-- ) -->
<!-- axis(1, at = seq(0, 1, .2)) #adds custom x axis -->
<!-- axis(2, las = 1) # custom y axis -->
<!-- ``` -->
<!-- The Bayes factor in favour of Hermione compared to Harry is given by: -->
<!-- ```{r} -->
<!-- K <- 2 -->
<!-- Z <- pbeta(0.25, 10, 2) + (1 - pbeta(0.75, 10, 2)) -->
<!-- Z / K -->
<!-- ``` -->
<p>In the next section, we apply this idea to a data set that relates Hogwarts Houses to personality traits.</p>
<h1 id="hogwarts-houses-and-personality">Hogwarts Houses and personality</h1>
<p>So, are you a Slytherin, Hufflepuff, Ravenclaw, or Gryffindor? And what does this say about your personality?</p>
<p>Inspired by Crysel et al. (2015), Lea Jakob, Eduardo Garcia-Garzon, Hannes Jarke, and I analyzed self-reported personality data from 847 people as well as their self-reported Hogwards House affiliation.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> We wanted to answer questions such as: do people who report belonging to Slytherin tend to score highest on Narcissism, Machiavellianism, and Psychopathy? Are Hufflepuffs the most agreeable, and Gryffindors the most extraverted? The Figure below visualizes the raw data.</p>
<div style="text-align:center;">
<img src="../assets/img/Potter-Personality.png" align="center" style="margin-top: -10px; padding-bottom: 0px;" width="680" height="540" />
</div>
<p>We used a between-subjects ANOVA as our model and, in the case of for example Agreeableness, compared the following hypotheses:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{H}_0&: \mu_H = \mu_G = \mu_R = \mu_S \\[.5em]
\mathcal{H}_r&: \mu_H > (\mu_G , \mu_R , \mu_S) \\[.5em]
\mathcal{H}_1&: \mu_H , \mu_G , \mu_R , \mu_S
\end{aligned} %]]></script>
<p>We used the BayesFactor R package to compute the Bayes factor in favour of $\mathcal{H}_1$ compared to $\mathcal{H}_0$. For the restricted hypotheses $\mathcal{H}_r$, we used the prior/posterior trick outlined above; and indeed, we found strong evidence in favour of the notion that, for example, Hufflepuffs score highest on Agreeableness. Curious about Slytherin and the other Houses? You can read the published paper with all the details <a href="https://www.collabra.org/article/10.1525/collabra.240/">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Participating in a relaxing prediction contest, we saw how three subjective Bayesians named Ron, Harry, and Hermione formalized their beliefs and derived their predictions about the likely outcome of ten coin flips. By restricting her prior beliefs about the bias of the coin to exclude values smaller than $\theta = 0.50$, Hermione was the boldest in her predictions and was ultimately rewarded. However, if the outcome of the coin flips would have turned out differently, say $y = 2$, then Hermione would have immediately realized how wrong her beliefs were. I think we as scientists need to be more like Hermione: we need to make more precise predictions, allowing us to construct more powerful tests and “fail” in insightful ways.</p>
<p>We also saw a neat trick by which one can compute the Bayes factor in favour of a restricted model compared to an unrestricted model by estimating the proportion of prior and posterior values of the parameter that are in line with the restriction — no painstaking computation of marginal likelihoods required! We used this trick to find evidence for what we all knew deep in our hearts already: Hufflepuffs are <em>so</em> agreeable.</p>
<hr />
<p><em>I would like to thank Sophia Crüwell and Lea Jakob for helpful comments on this blog post.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Klugkist, I., Kato, B., & Hoijtink, H. (<a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9574.2005.00279.x">2005</a>). Bayesian model selection using encompassing priors. <em>Statistica Neerlandica, 59</em>(1), 57-69.</li>
<li>Wetzels, R., Grasman, R. P., & Wagenmakers, E. J. (<a href="https://www.sciencedirect.com/science/article/pii/S0167947310001180">2010</a>). An encompassing prior generalization of the Savage–Dickey density ratio. <em>Computational Statistics & Data Analysis, 54</em>(9), 2094-2102.</li>
<li>Etz, A., Haaf, J. M., Rouder, J. N., & Vandekerckhove, J. (<a href="https://journals.sagepub.com/doi/full/10.1177/2515245918773087">2018</a>). Bayesian inference and testing any hypothesis you can specify. <em>Advances in Methods and Practices in Psychological Science, 1</em>(2), 281-295.</li>
<li>Crysel, L. C., Cook, C. L., Schember, T. O., & Webster, G. D. (<a href="https://www.sciencedirect.com/science/article/abs/pii/S0191886915002615">2015</a>). Harry Potter and the measures of personality: Extraverted Gryffindors, agreeable Hufflepuffs, clever Ravenclaws, and manipulative Slytherins. <em>Personality and Individual Differences, 83</em>, 174-179.</li>
<li>Jakob, L., Garcia-Garzon, E., Jarke, H., & Dablander, F. (<a href="https://www.collabra.org/article/10.1525/collabra.240/">2019</a>). The Science Behind the Magic? The Relation of the Harry Potter “Sorting Hat Quiz” to Personality and Human Values. <em>Collabra: Psychology, 5</em>(1).</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The analytical solution is <a href="https://www.wolframalpha.com/input/?i=Integral%5Btheta%5Ey+*+%281+-+theta%29%5E%28n+-+y%29%2C+theta%2C+0.50%2C+1%5D">unpleasant</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>You can discover your Hogwarts House affiliation at <a href="https://www.pottermore.com/">https://www.pottermore.com/</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderIf you are reading this, you are probably a Ravenclaw. Or a Hufflepuff. Certainly not a Slytherin … but maybe a Gryffindor? In this blog post, we let three subjective Bayesians predict the outcome of ten coin flips. We will derive prior predictions, evaluate their accuracy, and see how fortune favours the bold. We will also discover a neat trick that allows one to easily compute Bayes factors for models with parameter restrictions compared to models without such restrictions, and use it to answer a question we truly care about: are Slytherins really the bad guys? Preliminaries As in a previous blog post, we start by studying coin flips. Let $\theta \in [0, 1]$ be the bias of the coin and let $y$ denote the number of heads out of $n$ coin flips. We use the Binomial likelihood and a Beta prior for $\theta$: This prior is conjugate for this likelihood which means that the posterior is again a Beta distribution. The Figure below shows two examples of this. In this blog post, we will use a prior predictive perspective on model comparison by means of Bayes factors. For an extensive contrast with a perspective based on posterior prediction, see this blog post. The Bayes factor indicates how much better a model $\mathcal{M}_1$ predicts the data $y$ relative to another model $\mathcal{M}_0$: where we can write the marginal likelihood of a generic model $\mathcal{M}$ more complicatedly to see the dependence on the model’s priors: After these preliminaries, in the next section, we visit Ron, Harry, and Hermione in Hogwarts. The Hogwarts prediction contest Ron, Harry, and Hermione just came back from a straining adventure — Death Eaters and all. They deserve a break, and Hermione suggests a small prediction contest to relax. Ron is put off initially; relaxing by thinking? That’s not his style. Harry does not care either way; both are eventually convinced. The goal of the contest is to accuratly predict the outcome of $n = 10$ coin flips. Luckily, this is not a particularly complicated problem to model, and we can use the Binomial likelihood we have discussed above. In the next section, Ron, Harry, and Hermione — all subjective Bayesians — clearly state their prior beliefs which is required to make predictions. Prior beliefs Ron is not big on thinking, and so trusts his previous intuitions that coins are usually unbiased; he specifies a point mass on $\theta = 0.50$ as his prior. Harry spreads his bets evenly, and believes that all chances governing the coin flip’s outcome are equally likely; he puts a uniform prior on $\theta$. Hermione, on the other hand, believes that the coin cannot be biased towards tails; instead, she believes that all values $\theta \in [0.50, 1]$ are equally likely. She thinks this because Dobby — the house elf — is the one who throws the coin, and she has previously observed him passing time by flipping coins, which strangely almost always landed up heads. To sum up, their priors are: which are visualized in the Figure below. In the next section, the three use their beliefs to make probabilistic predictions. Prior predictions Ron, Harry, and Hermione are subjective Bayesians and therefore evaluate their performance by their respective predictive accuracy. Each of the trio has a prior predictive distribution. For Ron, true to character, this is the easiest to derive. We associate model $\mathcal{M}_0$ with him and write: where the integral — the sum! — is just over the value $\theta = 0.50$. Ron’s prior predictive distribution is simply a Binomial distribution. He is delighted by this fact, and enjoys a short rest while the others derive their predictions. It is Harry’s turn, and he is a little put off by his integration problem. However, he realizes that the integrand is an unnormalized Beta distribution, and swiftly writes down its normalizing constant, the Beta function. Associating $\mathcal{M}_1$ with him, his steps are: which is a Beta-Binomial distribution with $\alpha = \beta = 1$. Hermione’s integral is the most complicated of the three, but she is also the smartest of the bunch. She is a master of the wizardry that is computer programming, which allows her to solve the integral numerically.1 We associate $\mathcal{M}_r$, which stands for restricted model, with her and write: We can draw from the prior predictive distributions by simulating from the prior and then making predictions through the likelihood. For Hermione, for example, this yields: The analytical solution is unpleasant. ↩Love affairs and linear differential equations2019-08-29T11:00:00+00:002019-08-29T11:00:00+00:00https://fabiandablander.com/r/Linear-Love<p>Differential equations are a powerful tool for modeling how systems change over time, but they can be a little hard to get into. Love, on the other hand, is humanity’s perennial topic; some even claim it is all you need. In this blog post — inspired by Strogatz (1988, 2015) — I will introduce linear differential equations as a means to study the types of love affairs two people might find themselves in.</p>
<p>Do opposites attract? What happens to a relationship if lovers are out of touch with their own feelings? We will answer these and other questions using two coupled linear differential equations. On our journey, we will use graphical as well as mathematical methods to classify the types of relationships this modeling framework can accommodate. In a follow-up blog post, we will also play around with non-linear terms and add a third wheel to the mix, which can lead to chaos — in the technical sense of the term, of course. Excited? Then let’s get started!</p>
<h1 id="introducing-romeo">Introducing Romeo</h1>
<blockquote>
A lovestruck Romeo sang the streets of serenade <br />
Laying everybody low with a love song that he made <br />
Finds a streetlight, steps out of the shade <br />
Says something like, "You and me, babe, how about it?"
</blockquote>
<p>Romeo is quite the emotional type. Let $R(t)$ denote his feelings at time point $t$. Following common practice, we will usually write $R$ instead of $R(t)$, making the time dependence implicit. The process which describes how Romeo’s feelings change is rather simple: it depends only on Romeo’s current feelings. We write:</p>
<script type="math/tex; mode=display">\frac{\mathrm{d}R}{\mathrm{d}t} = aR \enspace ,</script>
<p>which is a linear differential equation. Note that this <em>implicitly</em> encodes how Romeo’s feelings change over time, since when we know $R$ at time point $t$, we can compute the direction and speed with which $R$ will change — the derivative denotes velocity. Our goal, however, is to find an explicit, closed-form expression for Romeo’s feelings at time point $t$. In this particular case, we can do this analytically:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= aR \\[.5em]
\frac{1}{aR}\mathrm{d}R &= \mathrm{dt} \\[.5em]
\frac{1}{a}\int \frac{1}{R}\mathrm{d}R &= \int \mathrm{dt} \\[.5em]
\frac{1}{a} \left[\text{log} \, R + C \right] &= t \\[.5em]
\text{log} \, R &= a t - C \\[.5em]
R &= e^{at - C} \enspace .
\end{aligned} %]]></script>
<p>A differential equation describes how something changes; to kickstart the process, we need an initial condition $R_0$. This allows us to find the constant of integration $C$. In particular, assume that $R = R_0$ at $t = 0$, which leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R_0 &= e^{-C} \\[.5em]
\text{log} \, R_0 &= -C \enspace ,
\end{aligned} %]]></script>
<p>such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R &= e^{at + \text{log} \, R_0} \\[.5em]
R &= R_0 e^{at} \enspace .
\end{aligned} %]]></script>
<p>The left two panels of the figure below visualize how Romeo’s feelings change over time for $a > 0$ with initial condition $R_0 = 1$ (top) or $R_0 = -1$ (bottom). The right two panels show how his feelings change for $a < 0$ with $R_0 = 100$ (top) or $R_0 = -100$ (bottom).</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>We conclude: Romeo is a simple guy, albeit with an emotion regulation problem. When the object of his affection is such that $a > 0$, his feelings will either grow exponentially towards mad love if he starts out with a positive first impression ($R_0 > 0$), or grow exponentially towards mad hatred if he starts out with a negative first impression ($R_0 < 0$). On the other hand, if $a < 0$, then regardless of his initial feelings, they will decay exponentially towards indifference.</p>
<p>For $R_0 = 0$, Romeo’s feelings never change. For any other initial condition, we have uhindered, exponential growth when $a > 0$; it never stops. For any other initial condition and $a < 0$, we crash down to zero very rapidly. Thus $R = 0$ is a <em>fixed point</em> in both cases, which is <em>stable</em> for $a < 0$ but becomes <em>unstable</em> if $a > 0$. We can visualize this in <em>phase space</em> on a line. The phase space is filled with all possible trajectories because each point can serve as the initial condition.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In the next section, a wonderful new episode in Romeo’s life begins: he meets Juliet.</p>
<h1 id="introducing-juliet">Introducing Juliet</h1>
<blockquote>
Juliet says, "Hey, it's Romeo, you nearly gave me a heart attack" <br />
He's underneath the window, she's singing, "Hey, la, my boyfriend's back <br />
You shouldn't come around here singing up at people like that <br />
Anyway, what you gonna do about it?"
</blockquote>
<p>Life becomes more complicated for Romeo now that Juliet is in his life. It is their first real relationship, and they have much to learn. We start simple. Let $J$ denote Juliet’s feelings for Romeo, and let $R$ denote Romeo’s feelings for Juliet. We can extend our single linear differential equation from above to a system of two linear differential equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= aR\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= dJ \enspace .
\end{aligned} %]]></script>
<p>Using the results from above, the solutions to the two differential equations are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R(t) &= R_0 e^{at} \\[.5em]
J(t) &= J_0 e^{dt} \enspace ,
\end{aligned} %]]></script>
<p>where $R(t)$ and $J(t)$ give the trajectories of love for Romeo and Juliet, respectively, and $J_0$ is Juliet’s initial feeling towards Romeo at $t = 0$. In contrast to the one-dimensional phase diagram from above, we now have a two-dimensional picture which is known as a <em>vector field</em>.</p>
<p>Analogously to the case of a single differential equation, $a < 0$ and $d < 0$ imply exponential decay for Romeo and Juliet’s love, and $a > 0$ and $d > 0$ imply exponential growth. The left figure below visualizes decay: whatever the initial state of their love, it will crash into the origin of indifference. The figure on ther right visualizes growth: whatever the initial state, except for indifference, their feelings will grow exponentially and eventually consume them.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>This can result in happy, ever increasing love if they start out liking each other (top right quadrant), but can also result in an increasingly violent feud if they start out disliking each other (bottom left quadrant). For asymmetric starts, one of them will be hopelessly in love with the other, while the other’s hate grows unboundedly. The fixed point (0, 0) is <em>stable</em> on the left, as any tiny perturbation will move the system towards it. In contrast, the fixed point on the right is <em>unstable</em>, as any ounce of love or hate, no matter how small, will make the system explode. One unfortunate subtlety arises, however: if Romeo loves Juliet, but Juliet is indifferent, then Juliet will forever stay indifferent even though Romeo’s love grows without bound.</p>
<p>Another interesting case occurs when their affection is asymmetric, i.e., $a \neq d$. The figure below on the left shows one such case for negative parameters: we see that whatever feelings Juliet has for Romeo, they decay faster then the feelings Romeo has for Juliet. Moreover, since $(a, d) < 0$, the origin is stable. The figure on the right shows a more impactful asymmetry: Romeo’s feelings decay ($a < 0$), but Juliet’s increase ($d > 0$). Regardless of what the initial feelings of Romeo are, he will always end up in a state of indifference with respect to Juliet (all arrows point to the y-axis). Juliet, on the other hand, will go increasingly mad with love or hate, depending on her initial feelings — the exception being if she starts out with indifference ($J_0 = 0$): then she will stay indifferent. This type of fixed point is called a <em>saddle point</em>, which occurs if there is one vector along which the system is stable (here the x-axis) and one vector along which the system is unstable (here the y-axis).</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>What happens if Romeo’s feelings never change, i.e., $a = 0$? This is visualized as the figure on the left below: Romeo’s feelings will always stay at the initial point. Juliet’s feelings decrease ($d < 0$), so regardless of where she is, the system will end up at a stable fixed point on the x-axis. A similar situation occurs if Juliet’s feelings never change, and Romeo’s feelings decay ($a < 0$), which is visualized in the figure on the right: all points on the y-axis are stable fixed points. If instead the moving parties’ feelings would increase instead of decay, the fixed points would be unstable. The most boring case is $a = 0$ and $d = 0$, because then every point on the plane is a fixed point: however the two lovers start, they will never change.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Note that in all the love affairs described above, the feelings of Romeo and Juliet are actually <em>independent of each other</em>. They <em>do not communicate</em> with each other, and <em>we all know that communication is key</em>! In the next section, Romeo and Juliet’s relationship matured and they start taking each other seriously. Formally, we <em>couple</em> the two love birds and analyze what types of love this can set free.</p>
<h1 id="coupled-differential-equations">Coupled differential equations</h1>
<p>In the previous section, we saw that the behaviour of the system was determined entirely by the values of $a$ and $d$ — depending on whether $a$ or $d$ were positive, negative, or zero, the system would either be stable or unstable along the $R$ or $J$ dimension. Incorporating communication complicates the system, but is ultimately for the better. To model the fact that Romeo and Juliet now respond to each other’s feelings, we simply write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= aR + bJ\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= cR + dJ \enspace ,
\end{aligned} %]]></script>
<p>or in matrix form:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix}
\frac{\mathrm{d}R}{\mathrm{d}t} \\
\frac{\mathrm{d}J}{\mathrm{d}t}
\end{pmatrix} &= \begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} R \\ J\end{pmatrix} \\[.5em]
\dot{\mathbf{x}} &= A \mathbf{x} \enspace .
\end{aligned} %]]></script>
<p>The classification of such a system is more difficult. In the next section, we introduce one type of relationship between a matured Romeo and Juliet that will motivate a general solution to coupled differential equations.</p>
<h2 id="the-saddle-of-love">The Saddle of love</h2>
<blockquote>
I might not be the right one <br />
It might not be the right time <br />
But there's something about us I've got to do <br />
Some kind of secret I will share with you
</blockquote>
<p>In a previous life, Juliet and Romeo did not communicate ($b = c = 0$) but listened to their own feelings in opposite ways ($a = -1$ and $d = 1$). Here we include communication:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= -2R + J\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= -R + 2J \enspace .
\end{aligned} %]]></script>
<p>Specifically, Romeo dampens his feelings the more strongly he feels ($a = -2$) and listens to Juliet such that whichever way her feelings go, Romeo’s follow suit ($b = 1$). Juliet does the opposite: she increases her feelings of love or hate the more strongly she feels ($d = 2$), and responds to Romeo such that whichever way his feelings go, Juliet’s feelings move the other way ($c = -1$). In a sense, Romeo and Juliet are opposites — can any good come from this?</p>
<p>Before answering this question, we first find a general solution to systems of linear differential equations. This gives us a way to formally classify any (linear) relationships between Romeo and Juliet. The solution will involve eigenvectors and eigenvalues, so let’s put our sleeves up and get to work!</p>
<h2 id="solving-coupled-differential-equations">Solving coupled differential equations</h2>
<p>In contrast to the first system of linear equations above where Romeo and Juliet did not communicate with each other, the system now is <em>coupled</em>: Romeo’s feelings influence Juliet’s and vice versa. Now, if their feelings would instead be independent, then the solution to the differential equations would be easy: just as above, their respective feelings would either grow or decay exponentially. The dependence between their feelings is encoded in the matrix $A$. If $A$ were diagonal, then the equations would be independent.</p>
<p>The solution to our problem thus presents itself: somehow, we must manage to make the matrix $A$ diagonal. We can do this by changing basis, a trick we have also used in deriving a <a href="https://fabiandablander.com/r/Fibonacci.html">closed-form expression of the Fibonacci numbers</a> in a previous blog post. If you are unfamiliar with these ideas, it might pay to read the previous blog post before proceeding.</p>
<p>Assuming that $A$ is <em>diagonalizable</em> (more on that latter), we can write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A &= E \Lambda E^{-1} \\[.5em]
\begin{pmatrix} a & b \\ c & d \end{pmatrix} &= \begin{pmatrix} \mathbf{v}_1 & \mathbf{v}_2\end{pmatrix}\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix} \begin{pmatrix} \mathbf{v}_1 & \mathbf{v}_2\end{pmatrix}^{-1} \enspace ,
\end{aligned} %]]></script>
<p>where $(\lambda_1, \lambda_2)$ are the eigenvalues of $A$ and $\mathbf{v}_1$ and $\mathbf{v}_2$ are the respective eigenvectors. Conceptually, multiplying a vector with $E^{-1}$ changes its basis from the standard basis to the basis of eigenvectors. In this space, the matrix encoding the dependence between our two differential equations is the diagonal matrix of eigenvalues $\Lambda$ — the two differential equations are independent! We know that in this space the solution to the differential equations are independent exponential functions. However, we have to change back to our standard basis, and we do so by multiplying with $E$.</p>
<p>With this insight, our system of differential equation becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\dot{\mathbf{x}} &= E \Lambda E^{-1} \mathbf{x} \\
E^{-1} \dot{\mathbf{x}} &= \Lambda E^{-1} \mathbf{x} \\
\dot{\mathbf{u}} &= \Lambda \mathbf{u} \enspace ,
\end{aligned} %]]></script>
<p>where we have defined $\mathbf{u} = E^{-1}\mathbf{x}$, which is now with respect to the eigenbasis. Now since:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Lambda = \begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix} \enspace , %]]></script>
<p>the solution to the two differential equations is:</p>
<script type="math/tex; mode=display">\mathbf{u} = \begin{pmatrix} C_1e^{\lambda_1 t} \\ C_2e^{\lambda_2 t} \end{pmatrix} \enspace ,</script>
<p>where $C_1$ and $C_2$ are the constants of integration which we earlier denoted as $R_0$ and $J_0$. To change back basis, we multiply with $E$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{x} &= E \mathbf{u} \\[.5em]
\mathbf{x} &= \begin{pmatrix} \mathbf{v}_1 & \mathbf{v}_2 \end{pmatrix} \begin{pmatrix} C_1e^{\lambda_1 t} \\ C_2e^{\lambda_2 t} \end{pmatrix} \\[.5em]
\mathbf{x} &= \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \enspace ,
\end{aligned} %]]></script>
<p>where $\mathbf{v}_1$ and $\mathbf{v}_2$ are eigenvectors and $\lambda_1$ and $\lambda_2$ are the corresponding eigenvalues. Therefore, solving a system of ordinary linear differential equations reduces to finding the eigenvalues and eigenvectors of the matrix $A$.</p>
<h2 id="finding-eigenvalues-and-eigenvectors">Finding eigenvalues and eigenvectors</h2>
<p>An eigenvector of a matrix is a vector that is only stretched by the matrix by a factor of $\lambda$, such that for $v \neq 0$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A\mathbf{v} &= \lambda \mathbf{v} \\[.5em]
(A - I\lambda) \mathbf{v} &= 0 \enspace ,
\end{aligned} %]]></script>
<p>which is true when the determinant of $(A - I\lambda)$ is zero, that is, $\left\vert A - I\lambda\right\vert = 0$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\left\vert\begin{pmatrix} a & b \\ c & d\end{pmatrix} - \begin{pmatrix} \lambda & 0 \\ 0 & \lambda \end{pmatrix}\right\vert &= 0 \\[1em]
\left\vert\begin{pmatrix} a - \lambda & b \\ c & d - \lambda \end{pmatrix}\right\vert &= 0 \\[1em]
(a - \lambda)(d - \lambda) - bc &= 0 \\[1em]
\lambda^2 - \lambda(a + d) - ad + bc &= 0 \enspace .
\end{aligned} %]]></script>
<p>We define:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\tau &\equiv \text{trace}(A) = a + d \\[.5em]
\Delta &\equiv \vert A\vert = ad - bc \enspace ,
\end{aligned} %]]></script>
<p>and recall the quadratic formula to find both eigenvalues:</p>
<script type="math/tex; mode=display">\lambda = \frac{\tau \pm \sqrt{\tau^2 - 4\Delta}}{2} \enspace .</script>
<p>In the next section, we apply this to the “saddle of love” differential equation in order to better understand the trajectories Romeo and Juliet’s love could take.</p>
<h2 id="solving-the-saddle-of-love">Solving the saddle of love</h2>
<p>Recall that we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= -2R + J\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= -R + 2J \enspace .
\end{aligned} %]]></script>
<p>such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -2 & 1 \\ -1 & 2\end{pmatrix} \enspace . %]]></script>
<p>For our saddle of love, the eigenvalues are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda &= \frac{0 \pm \sqrt{0 - 4\cdot(-3)}}{2} = \frac{\pm \sqrt{4 \cdot 3}}{2} = \pm \sqrt{3} \enspace .
\end{aligned} %]]></script>
<p>To find the first eigenvector, we compute for the first eigenvalue $\lambda_1 = \sqrt{3}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(A - I\lambda_1)\mathbf{v}_1 &= 0 \\[.5em]
\begin{pmatrix} -2 - \sqrt{3} & -1 \\ 1 & 2 - \sqrt{3} \end{pmatrix} \begin{pmatrix} v_1 \\ v_2\end{pmatrix} &= \begin{pmatrix} 0 \\ 0 \end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>which has solution $\mathbf{v}_1 = (1, 2 + \sqrt{3})^T$. For $\lambda_2 = -\sqrt{3}$, the eigenvector is $\mathbf{v}_2 = (1, 2 - \sqrt{3})^T$. We can verify this in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## eigen() decomposition
## $values
## [1] -1.732051 1.732051
##
## $vectors
## [,1] [,2]
## [1,] -0.9659258 -0.2588190
## [2,] -0.2588190 -0.9659258</code></pre></figure>
<p>which scales the eigenvectors to have unit length by dividing by its norm, and in this case also multiplies by $-1$; this does not matter, as eigenvectors are only defined up to a constant factor.</p>
<p>Plugging the eigenvalues and eigenvectors into our general solution form yields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{x} &= \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \\[.5em]
\mathbf{x} &= \begin{pmatrix} 1 \\ 2 + \sqrt{3} \end{pmatrix} C_1e^{\sqrt{3} t} + \begin{pmatrix} 1 \\ 2 - \sqrt{3} \end{pmatrix} C_2e^{-\sqrt{3}t} \enspace .
\end{aligned} %]]></script>
<p>We still need to solve for the constants $C_1$ and $C_2$. Assume that at $t = 0$, the feelings for Romeo and Juliet are $\mathbf{x} = (1, 1)^T$. Then we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} 1 \\ 1 \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ 2 + \sqrt{3} & 2 - \sqrt{3} \end{pmatrix} \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} \\[.5em]
\begin{pmatrix} 1 & 1 \\ 2 + \sqrt{3} & 2 - \sqrt{3} \end{pmatrix}^{-1}\begin{pmatrix} 1 \\ 1 \end{pmatrix} &= \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} \\[.5em]
\begin{pmatrix} 0.21 \\ 0.79 \end{pmatrix} &= \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>which yields the following solutions for Romeo and Juliet:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R(t) &= 0.21 \cdot e^{\sqrt{3} t} + 0.79 \cdot e^{-\sqrt{3} t} \\[.5em]
J(t) &= 0.79 \cdot e^{\sqrt{3} t} + 0.21 \cdot e^{-\sqrt{3} t} \enspace .
\end{aligned} %]]></script>
<p>Note how this result differs from when Romeo and Juliet did not communicate: the solution is a linear combination of two exponentials — the two lovebirds are clearly coupled! Now that we have seen one worked example, the code below computes the trajectory of Romeo and Juliet for an arbitrary matrix $A$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_linear</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">tmax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># compute eigenvectors and eigenvalues</span><span class="w">
</span><span class="n">eig</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eig</span><span class="o">$</span><span class="n">vectors</span><span class="w">
</span><span class="n">lambdas</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eig</span><span class="o">$</span><span class="n">values</span><span class="w">
</span><span class="c1"># solve for the initial conditon</span><span class="w">
</span><span class="n">C</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">inits</span><span class="w">
</span><span class="c1"># create time steps</span><span class="w">
</span><span class="n">ts</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tmax</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">t</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ts</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="p">(</span><span class="n">C</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">lambdas</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">t</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Re drops the imaginary part ... more on that later!</span><span class="w">
</span><span class="nf">Re</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The code for visualizing vector fields for two coupled linear differential equations is given below.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'fields'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_vector_field</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">.50</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">.50</span><span class="p">)</span><span class="w">
</span><span class="n">RJ</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">expand.grid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w">
</span><span class="n">dRJ</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">RJ</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="w">
</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">arrow.plot</span><span class="p">(</span><span class="w">
</span><span class="n">RJ</span><span class="p">,</span><span class="w"> </span><span class="n">dRJ</span><span class="p">,</span><span class="w">
</span><span class="n">arrow.ex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.05</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray82'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">3.9</span><span class="p">,</span><span class="w"> </span><span class="m">-.2</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.2</span><span class="p">,</span><span class="w"> </span><span class="m">3.9</span><span class="p">,</span><span class="w"> </span><span class="s1">'J'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">-4</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Before we visualize the vector field, let me again stress that the solution to a system of two coupled linear differential equation is of the form:</p>
<script type="math/tex; mode=display">\mathbf{x} = \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \enspace .</script>
<p>The eigenvectors coincide with the standard basis vectors when the two differential equations are independent, as was the case above when Romeo and Juliet did not communicate. In such cases, exponential growth or decay is along the standard basis vectors, i.e., the x- and y-axes. For the case we are considering now, this is not the true — the eigenvectors are different from the standard basis vectors. It therefore makes sense to visualize the eigenvectors, as they are in some sense more fundamental to the solution. However, we want to retain the interpretability of the standard basis, as this is our reference frame for the initial condition. In the following visualizations, therefore, we add the eigenvectors which makes it apparent exactly in which directions there is exponential growth or decay.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_eigenvectors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="o">$</span><span class="n">vectors</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">4</span><span class="w">
</span><span class="n">arrows</span><span class="p">(</span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="n">arrows</span><span class="p">(</span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">add_line</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">solve_linear</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inits</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'red'</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">plot_vector_field</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'The Saddle of Love'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_eigenvectors</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">.75</span><span class="p">))</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">.75</span><span class="p">))</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.7</span><span class="p">)</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2.2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'white'</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<p>The figure above visualizes the resulting vector field, the standard basis (solid lines), the eigenvectors (dashed lines), and four example trajectories (red lines). The eigenvectors define different quadrants than the standard basis. If Romeo and Juliet start in the top right or top left eigenquadrant, then their love grows exponentially. If they start in the bottom left or bottom right eigenquadrant, their hate grows exponentially. Note that we again have a saddle point, as there is exponential decay along one eigenvector and exponential growth along the other; only if Romeo and Juliet’s initial feelings are exactly on the decaying eigenvector do we end up in a state of indifference.</p>
<!-- An interesting case is if Juliet starts out positive while Romeo has initial feelings of hate, but not too much so that they are in the bottom left eigenquadrant, their love grows eternally. This makes sense: Romeo downweights his own feelings ($b = -2$) and is positively influenced by Juliet's love ($d = 1$). -->
<!-- On the other hand, if Juliet starts out negative and Romeo starts with love, they increasingly hate each other. This is again reasonable, as Romeo downplays his own positive feelings and "takes over" Juliet's negative ones. So, too, is their fate when both start out with hate. -->
<p>In the next section, we go beyond the saddle of love and study what different matrices $A$ imply for the stability landscape of love affairs.</p>
<h1 id="a-classification-of-linear-systems">A classification of linear systems</h1>
<p>All we need to know to classify the relationship between Romeo and Juliet is the trace $\tau = a + d$ and the determinant $\Delta = ad - bc$ of the matrix $A$. We can rewrite these in terms of eigenvalues:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda_1 + \lambda_2 &= \frac{1}{2}\left(\tau + \sqrt{\tau^2 - 4\Delta}\right) + \frac{1}{2}\left(\tau - \sqrt{\tau^2 - 4\Delta}\right) = \tau \\[.5em]
\lambda_1 \lambda_2 &= \frac{1}{2}\left(\tau + \sqrt{\tau^2 - 4\Delta}\right)\frac{1}{2}\left(\tau - \sqrt{\tau^2 - 4\Delta}\right) \\[.5em]
&= \frac{1}{4} \left(\tau^2 - \tau^2 + 4\Delta\right) \\[.5em]
&= \Delta \enspace ,
\end{aligned} %]]></script>
<p>which means that we can characterize a linear system solely by its eigenvalues. If $\lambda_1 < 0$ we have exponential decay and if $\lambda_1 > 0$ we have exponential growth in the direction of the first eigenvector, $\mathbf{v}_1$. The same holds for $\lambda_2$ and $\mathbf{v}_2$.</p>
<h2 id="keepin-it-real-attracting-and-repelling-nodes">Keepin’ it real: Attracting and repelling nodes</h2>
<blockquote>
No it ain't no use in callin' out my name, gal <br />
Like you never done before <br />
And it ain't no use in callin' out my name, gal <br />
I can't hear ya any more.
</blockquote>
<p>If $\tau^2 - 4\Delta > 0$, both eigenvalues are real. If both are negative, then the origin is an attracting fixed point; if they are positive, the origin is a repelling fixed point. As an example, take this matrix:</p>
<script type="math/tex; mode=display">% <![CDATA[
A_{1} = \begin{pmatrix} -1 & 0.50 \\ 1 & -1\end{pmatrix} \enspace , %]]></script>
<p>which means that Romeo downplays his feelings as strongly as Juliet, but is influenced only half as strongly by Juliet’s feeling as Juliet is by his feelings.</p>
<p>The matrix:</p>
<script type="math/tex; mode=display">% <![CDATA[
A_{2} = \begin{pmatrix} 1 & 0.50 \\ 0.25 & 0.50 \end{pmatrix} \enspace , %]]></script>
<p>shows what both Romeo and Juliet reinforce each other’s feeings ($b = 0.50$ and $c = 0.25$) as well as their own ($a = 1$ and d = $0.50$). We know from above that this cannot be mathematically stable!</p>
<p>The figure on the left below shows that indifference is the result of the relationship govenered by $A_1$, regardless of the starting point. Nodes generally have a slow and a fast eigendirection; the larger the eigenvalue, the stronger the pull in the direction of the corresponding eigenvector. For the stable node on the left, the fast eigendirection is clearly given by the negative eigenvector — all trajectories are strongly pulled in its direction; only gradually are they pulled in the other eigendirection until they end up at the origin.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>The figure on the right shows the relationship governed by $A_2$, which yields a more tumultuous love affair. In particular, Romeo and Juliet always have opposite feelings toward each other that also grow exponentially: Romeo becomes madder and madder in love with Juliet while Juliet becomes more and more hateful towards him, or the reverse — it doesn’t matter how loud one of them calls the other, there will be no positive response. The fast eigendirection is now given by the positive eigenvector; all trajectories initially go up (or down) a bit, before they get pulled heavily in the eigenvector’s direction, moving almost parallel to it.</p>
<p>In both the above cases, the eigenvalues are distinct. This allows one eigendirection to be slow and the other fast. In the next section, we look at what happens when both eigenvalues are equal.</p>
<h2 id="one-dimensional-love">One-dimensional love</h2>
<blockquote>
Ah, now I don't hardly know her <br />
But I think I could love her <br />
Crimson and clover
</blockquote>
<p>If $\tau^2 - 4\Delta = 0$, the matrix $A$ does not have distinct eigenvalues. We can distinguish two cases. First, as in our very first example, we could have:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} \lambda & 0 \\ 0 & \lambda \end{pmatrix} \enspace , %]]></script>
<p>which yields a <em>star node</em>: all directions point either to the origin ($\lambda < 0$) or away from it ($\lambda > 0$). We have visualized this vector field for $\lambda = -1$ and $\lambda = 1$ when Romeo met Juliet, so we do not visualize it here. In this case, $A$ is <em>diagonalizable</em>, that is, we can find matrices $\Lambda$ and $E$ such that:</p>
<script type="math/tex; mode=display">A = E \Lambda E^{-1} \enspace .</script>
<p>To see this in R, assume that $\lambda = -1$. The following could should give us $A$ back.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">check_decomposition</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">eig</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eig</span><span class="o">$</span><span class="n">vectors</span><span class="w">
</span><span class="n">Lambda</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="n">eig</span><span class="o">$</span><span class="n">values</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">Lambda</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="o">-</span><span class="n">diag</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">check_decomposition</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1] [,2]
## [1,] -1 0
## [2,] 0 -1</code></pre></figure>
<p>For $A$ to be diagonalizable, we require that $E$, the matrix of eigenvectors, is invertible. A matrix is invertible if it is <em>full rank</em>, which requires that the eigenvectors be independent, that is, they must span the plane. This brings us to the second case. Assume that:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -1 & -1 \\ 0 & -1 \end{pmatrix} \enspace . %]]></script>
<p>Then the two eigenvalues are again equal, but <em>the eigenvectors are not independent</em>. We can still compute the eigendecomposition in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">A</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">-1</span><span class="w">
</span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## eigen() decomposition
## $values
## [1] -1 -1
##
## $vectors
## [,1] [,2]
## [1,] 1 1.000000e+00
## [2,] 0 2.220446e-16</code></pre></figure>
<p>The only eigenvector is $\mathbf{v}_1 = (1, 0)^T$, even though R tells us that there are two distinct ones due to numerical imprecision. If we were to diagonalize the matrix, we would get an error, since $E$ is <em>singular</em>, that is, not invertible:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">check_decomposition</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Error in solve.default(E): system is computationally singular: reciprocal condition number = 1.11022e-16</code></pre></figure>
<p>We can, however, still visualize the vector field. We now have a <em>degenerate node</em> in which all trajectories are parallel to the eigenvector (which in this case is the x-axis, since $\mathbf{v}_1 = (1, 0)^T$).</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-16-1.png" title="plot of chunk unnamed-chunk-16" alt="plot of chunk unnamed-chunk-16" style="display: block; margin: auto;" /></p>
<p>While we can plot the vector field, we cannot use our diagonalization trick to compute a closed-form solution, since we cannot invert $E$. We could use numerical methods to compute trajectories; I will discuss this in more detail in a follow-up post on nonlinear differential equations for which we generally cannot get a closed-form expression. However, we can get such an expression for linear systems even if $A$ is not diagonalizable by using <em>matrix exponentials</em>. Since this would take us a little too far here, I defer this treatment to the <em>Post Scriptum</em>.</p>
<p>In the next two sections, we complete our classification of linear systems by allowing Romeo and Juliet’s love to oscillate.</p>
<h2 id="spiralling-love">Spiralling love</h2>
<blockquote>
Sometimes I feel so happy <br />
Sometimes I feel so sad <br />
Sometimes I feel so happy <br />
But mostly you just make me mad <br />
Baby, you just make me mad
</blockquote>
<p>Observe that:</p>
<script type="math/tex; mode=display">\lambda = \frac{\tau}{2} \pm \frac{\sqrt{\tau^2 - 4\Delta}}{2} \enspace ,</script>
<p>which will be complex if $\tau^2 - 4\Delta < 0$. We rewrite the eigenvalues slightly:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda &= \frac{\tau}{2} \pm \frac{\sqrt{\tau^2 - 4\Delta}}{2} \\[.5em]
&= \frac{\tau}{2} \pm \frac{\sqrt{-1}\sqrt{4\Delta - \tau^2}}{2} \\[.5em]
&= \alpha \pm i\omega \enspace ,
\end{aligned} %]]></script>
<p>where $\alpha = \tau / 2$ and $\omega = \sqrt{4\Delta - \tau^2} / 2$. The solution to the system of differential equation is still of the form:</p>
<script type="math/tex; mode=display">\mathbf{x} = \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \enspace .</script>
<p>However, the $\lambda$’s are now complex which results in:</p>
<script type="math/tex; mode=display">e^{\lambda t} = e^{(\alpha \pm i \omega)t} = e^{\alpha t} e^{\pm i\omega t} = e^{\alpha t} \left[\text{cos}(\omega t) + i \cdot \text{sin}(\omega t) \right] \enspace .</script>
<p>For $\alpha < 0$ and $\omega \neq 0$ we have <em>dampened oscillations</em>: they decay exponentially. For $\alpha > 0$ and $\omega \neq 0$ we have <em>amplifying oscillations</em>: they grow exponentially. To see this visually, let’s take the matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -0.20 & -1 \\ 1 & 0\end{pmatrix} \enspace . %]]></script>
<p>This implies that Romeo dampens his own feelings slightly ($a = -0.10$) and feels more love when Juliet hates him and more hate if Juliet loves him ($b = -1$). On the other hand, Juliet does not listen to her own feelings ($d = 0$) and mimicks Romeo’s feelings ($c = 1$). Where does this lead the two love birds?</p>
<p>The figure below on the left visualizes the vector field and one trajectory of love. The figure on the right visualizes Romeo and Juliet’s trajectory separately.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-17-1.png" title="plot of chunk unnamed-chunk-17" alt="plot of chunk unnamed-chunk-17" style="display: block; margin: auto;" /></p>
<p>Although both lovers start at mutual affection, over the course of their relationship, they feel happy, then sad, then happy, then sad, until they don’t feel anymore. If, on the other hand, we change $A$ to</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} 0.10 & -1 \\ 1 & 0\end{pmatrix} \enspace , %]]></script>
<p>we have $\alpha = 0.05$ which is positive. This implies slower growth than we had decay before ($\alpha = -0.10$). If we allow both lovers only an ounce of mutual affection $(0.1, 0.1)$, they will spiral forever, they feelings always growing, always changing.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-18-1.png" title="plot of chunk unnamed-chunk-18" alt="plot of chunk unnamed-chunk-18" style="display: block; margin: auto;" /></p>
<p>I encourage you to play around with the code a bit to get an intuition for these things. In the next section, we look at a special case of this linear system before we wrap-up.</p>
<h2 id="the-circle-of-love">The circle of love</h2>
<blockquote>
Oh, so long, Marianne <br />
It's time that we began to laugh <br />
And cry and cry and laugh about it all again. <br />
</blockquote>
<p>An interesting special case of the spiral of love occurs when $\alpha = 0$ such that all eigenvalues are imaginary. As an example, let $a = 0$, $b = -1$, $c = 1$, and $d = 0$ such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= -J \\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= R \enspace .
\end{aligned} %]]></script>
<p>Romeo and Juliet do not listen to their own feelings anymore, but only to their partner’s feelings. However, they do so in opposite ways. For Romeo, this model implies that when Juliet’s feelings for him are high ($J > 0$), Romeo’s feelings for Juliet <em>decrease</em>. If they are low ($J < 0$), then his feelings <em>increase</em>. For Juliet, it is exactly the opposite: when Romeo’s feelings are strong ($R > 0$), her feelings <em>increase</em>, while when his feelings wane ($R < 0$), her feelings <em>decrease</em>. Is this a (mathematically) stable relationship? To find out, we visualize the vector field below as well as three love trajectories.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-19-1.png" title="plot of chunk unnamed-chunk-19" alt="plot of chunk unnamed-chunk-19" style="display: block; margin: auto;" /></p>
<p>Romeo and Juliet are stuck in a never ending circle! Regardless of the starting point, they will be prisoners to the Sisyphean circle of love which will make them laugh and cry and cry and laugh about it all again. Except, of course, when they start at the origin $(0, 0)$: if they start with indifference, they will forever stay indifferent. Note that the fixed point is now called a <em>center</em> which is <em>neutrally stable</em>, since nearby trajectories are neither attracted nor repelled from the fixed point.</p>
<p>We have started and ended our journey of relationships with two extremes: ignoring the other’s feelings and ignoring one’s own. Both are unhealthy. <em>Communication is key</em>. In the next section, we recap the types of linear systems we have seen in this blog post.</p>
<h1 id="classification-recap">Classification recap</h1>
<blockquote>
She took off a silver locket <br />
She said remember me by this <br />
She put her hand in my pocket <br />
I got a keepsake and a kiss
</blockquote>
<p>The figure below summarizes the classification of linear systems we have, step by step, developed in this blog post (see also Strogatz, 2015, p. 140).</p>
<!-- <div style="text-align:center;"> -->
<!-- <img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" /> -->
<!-- </div> -->
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-20-1.png" title="plot of chunk unnamed-chunk-20" alt="plot of chunk unnamed-chunk-20" style="display: block; margin: auto auto auto 0;" /></p>
<p>If both $\tau = 0$ and $\Delta = 0$, the eigenvalues are zero and the solution is a constant: Romeo and Juliet’s feelings will forever stay wherever they started — we have a plane of fixed points. If $\tau = 0$ but $\Delta \neq 0$, either Romeo or Juliet’s feelings are constant, and the other person’s feelings either exponentially grow or decay — we have a line of fixed points.</p>
<p>Saddle points occur when $\tau \neq 0$ and $\Delta < 0$, which implies that one eigenvalue is positive and the other is negative, that is, we have exponentially growth in one eigendirection and exponential decay in the other; the fixed point $(0, 0)$ is generally unstable, except when the initial condition is exactly on the vector along we which there is exponential decay.</p>
<p>If $\tau = 0$ and $\Delta = 0$ all eigenvalues are imaginary, resulting in a <em>center</em> — the circle of love. These become <em>spirals</em> if $\tau \neq 0$, since the eigenvalues now have a real part which results in amplifying oscillations ($\tau > 0$) or dampened oscillations ($\tau < 0$).</p>
<p>On the parabola described by $\tau^2 - 4\Delta = 0$ we have repeated eigenvalues. If the resulting eigenvectors are independent, we have a <em>star node</em> in which all directions either point towards the origin ($\lambda < 0$) or away from it ($\lambda > 0$).</p>
<p>If the resulting eigenvectors are not independent, we have a <em>degenerate node</em>; we cannot invert the matrix of eigenvectors anymore and thus need to use other methods. One such method is provided by matrix exponentials — see the <em>Post Scriptum</em>.</p>
<p>Above the parabola, we either have <em>stable nodes</em> for $\tau < 0$ and <em>unstable nodes</em> for $\tau > 0$.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<h1 id="conclusion">Conclusion</h1>
<blockquote>
When you can fall for chains of silver, you can fall for chains of gold <br />
You can fall for pretty strangers and the promises they hold <br />
You promised me everything, you promised me thick and thin, yeah <br />
Now you just say "Oh, Romeo, yeah, you know I used to have a scene with him"
</blockquote>
<p>In this blog post, we have seen that linear differential equations are a powerful tool to model how systems change over time in general, and how the love affair between two lovebirds can evolve in particular. We have started out with an isolated Romeo whose feelings either exponentially grow or decay. Romeo then met Juliet, and we have extended the single differential equation to a system of two equations to accommodate this life event.</p>
<p>Love affairs can take many shapes and forms. We have classified those depending on their stability landscape, and seen that linear differential equations can be solved in closed-form by using eigenvectors and eigenvalues or matrix exponentials. In a follow-up blog post, Romeo and Juliet’s love will overcome the shackles of linearity, and we end up with nonlinear differential equations. This will make for more intriguing relationships. We will also add a third lover and study how the dynamics change — it might get chaotic!</p>
<hr />
<p>I would like to thank <a href="https://ryanoisin.github.io/">Oisín Ryan</a> for discussion as well as extensive and very helpful comments on this blog post.</p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="solving-differential-equations-using-matrix-exponentials">Solving differential equations using matrix exponentials</h3>
<!-- <blockquote> -->
<!-- There must be some kind of way outta here <br> -->
<!-- Said the joker to the thief <br> -->
<!-- There's too much confusion <br> -->
<!-- I can't get no relief -->
<!-- </blockquote> -->
<p>Recall that the solution to the single linear differential equation $\frac{\mathrm{d}x}{\mathrm{d}t} = ax$ is:</p>
<script type="math/tex; mode=display">x(t) = x_0 e^{at} \enspace .</script>
<p>The series expansion of $e$ is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
e^{at} &= 1 + at + \frac{(at)^2}{2!} + \frac{(at)^3}{3!} + \ldots \\[.5em]
&= \sum_{k=0}^{\infty} \frac{(at)^k}{k!} \enspace .
\end{aligned} %]]></script>
<p>The idea is to generalize this to allow for a matrix in the exponent. In particular, analogously to the one-dimensional case, we want the system</p>
<script type="math/tex; mode=display">\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} = A\mathbf{x} \enspace ,</script>
<p>to have solutions of the form:</p>
<script type="math/tex; mode=display">\mathbf{x}(t) = \mathbf{x}_0e^{At} \enspace .</script>
<p>First, we generalize the series expansion of $e$ to matrices:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
e^{At} &= I + At + \frac{(At)^2}{2!} + \frac{(At)^3}{3!} + \ldots \\[.5em]
&= \sum_{k=0}^{\infty} \frac{t^k}{k!} A^k \enspace ,
\end{aligned} %]]></script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A^0 &= I \\[.5em]
A^k &= \underbrace{A \cdot A \cdot \ldots \cdot A}_{\text{k times}} \enspace .
\end{aligned} %]]></script>
<p>With this definition, we assume that $\mathbf{x} = \mathbf{x}_0 e^{At}$ and check whether it is true that:</p>
<script type="math/tex; mode=display">\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} = \frac{\mathrm{d}\mathbf{x}_0e^{At}}{\mathrm{d}t} = A \mathbf{x} = A \mathbf{x}_0 e^{At} \enspace .</script>
<p>Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}\mathbf{x}_0e^{At}}{\mathrm{d}t} &= \mathbf{x}_0 \left(0 + A + \frac{2A^2 t}{2!} + \frac{3A^3t^2}{3!} + \ldots\right) \\[.5em]
&= \mathbf{x}_0 \left(A + A^2t + \frac{A^3t^2}{2!} + \ldots\right) \\[.5em]
&= \mathbf{x}_0 A\left(I + At + \frac{A^2t^2}{2!} + \ldots \right) \\[.5em]
&= \mathbf{x}_0 A e^{At} \\[.5em]
&= A \mathbf{x}_0 e^{At} \\[.5em]
&= A \mathbf{x} \enspace ,
\end{aligned} %]]></script>
<p>which shows that, indeed, the matrix exponential of $A$ is a solution to a system of linear differential equations!</p>
<!-- Why do we care? Our motivating example was that we cannot use the eigen decomposition to solve a system of linear differential equations when the eigenvectors are not independent, since the resulting matrix is not invertible. Using the matrix exponential, however, there is no mention of eigenvectors. -->
<p>The matrix exponential solution <em>generalizes</em> the solution using eigendecomposition to non-diagonal matrices $A$. For a diagonalizable matrix $A$, we can connect the approach of using the <a href="https://en.wikipedia.org/wiki/Matrix_exponential">matrix exponential</a> to solve a system of linear differential equations to the eigendecomposition approach we have discussed above. Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
e^{A} &= E e^{\Lambda} E^{-1} \\[.5em]
&= E \begin{pmatrix} e^{\lambda_1} & 0 \\ 0 & e^{\lambda_2}\end{pmatrix} E^{-1} \enspace ,
\end{aligned} %]]></script>
<p>that is, by noting that the matrix exponential of a diagonal matrix given by simply exponentiating each element. This is then the solution in the eigenbasis, which we transform back by multiplying with $E$, as we have done earlier. For diagonalizable matrices, this is a very convenient way of computing the matrix exponential. For general matrices, this is not possible and one needs to rely on other ways of computing the matrix exponential (see Moler & Van Loan, 2003).</p>
<p>To return to our initial problem: we want an expression for the solution of the system described by:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -1 & -1 \\ 0 & -1 \end{pmatrix} \enspace \enspace , %]]></script>
<p>in order to easily compute the trajectory of Romeo and Juliet’s feelings. Assuming that $\mathbf{x}_0 = (1, 1)$, the solution to the system is:</p>
<script type="math/tex; mode=display">\mathbf{x}(t) = \begin{pmatrix} 1 \\ 1\end{pmatrix} e^{At} \enspace ,</script>
<p>which we can implement straightforwardly in R.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'expm'</span><span class="p">)</span><span class="w">
</span><span class="n">solve_linear2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">tmax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># create time steps</span><span class="w">
</span><span class="n">ts</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tmax</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">t</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ts</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expm</span><span class="p">(</span><span class="n">A</span><span class="o">*</span><span class="n">t</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">inits</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">x</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below visualizes a few trajectories of this system that were hithertho uncomputable using the eigendecomposition.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-22-1.png" title="plot of chunk unnamed-chunk-22" alt="plot of chunk unnamed-chunk-22" style="display: block; margin: auto;" /></p>
<!-- [Strogatz mentions](https://youtu.be/QrHRaA93Nrg?list=PLbN57C5Zdl6j_qJA-pARJnKsmROzPnO9V&t=4404) that such degenerate nodes are rather unlikely in the real world, and that fits with our story since $d = 0$ implies that Juliet does not listen to her heart, which contradicts our assumption that she has matured as a lover. -->
<hr />
<h2 id="references">References</h2>
<ul>
<li>Strogatz, S. H. (<a href="https://www.tandfonline.com/doi/abs/10.1080/0025570X.1988.11977342">1988</a>). Love affairs and differential equations. <em>Mathematics Magazine, 6</em>1(1), 35-35.</li>
<li>Strogatz, S. H. (<a href="http://www.stevenstrogatz.com/books/nonlinear-dynamics-and-chaos-with-applications-to-physics-biology-chemistry-and-engineering">2015</a>). Nonlinear Dynamics and Chaos: With applications to Physics, Biology, Chemistry, and Engineering. Colorado, US: Westview Press.</li>
<li>Nonlinear Dynamics and Chaos Lectures by Steven Strogatz, especially <a href="https://www.youtube.com/watch?v=QrHRaA93Nrg&list=PLbN57C5Zdl6j_qJA-pARJnKsmROzPnO9V&index=5">Lecture 5</a>.</li>
<li>Ryan, O., Kuiper, R. M., & Hamaker, E. L. (<a href="https://link.springer.com/chapter/10.1007/978-3-319-77219-6_2">2018</a>). A continuous time approach to intensive longitudinal data: What, Why and How? In K. v. Montfort, J. H. L. Oud, & M. C. Voelkle (Eds.), <em>Continuous time modeling in the behavioral and related sciences</em>. New York: Springer.</li>
<li>Moler, C., & Van Loan, C. (<a href="https://epubs.siam.org/doi/abs/10.1137/S00361445024180?casa_token=ROT7WzzdP14AAAAA:qedJ1cEiWWcPbjq42eSdeKk7LhoAcJYx4eahw3txUDckZS0QCOJhCXaH2nSsuBViH_i8YwBwxQ">2003</a>). Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. <em>SIAM review, 45</em>(1), 3-49.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We have used the love affair between Romeo and Juliet to motivate the classification of a system of two linear differential equations. This was the main goal of the blog post. With this classification in mind, however, one could now study love affairs from a more “substantive” point of view; see Strogatz (1988) and Strogatz (2015, p. 143). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderDifferential equations are a powerful tool for modeling how systems change over time, but they can be a little hard to get into. Love, on the other hand, is humanity’s perennial topic; some even claim it is all you need. In this blog post — inspired by Strogatz (1988, 2015) — I will introduce linear differential equations as a means to study the types of love affairs two people might find themselves in. Do opposites attract? What happens to a relationship if lovers are out of touch with their own feelings? We will answer these and other questions using two coupled linear differential equations. On our journey, we will use graphical as well as mathematical methods to classify the types of relationships this modeling framework can accommodate. In a follow-up blog post, we will also play around with non-linear terms and add a third wheel to the mix, which can lead to chaos — in the technical sense of the term, of course. Excited? Then let’s get started! Introducing Romeo A lovestruck Romeo sang the streets of serenade Laying everybody low with a love song that he made Finds a streetlight, steps out of the shade Says something like, "You and me, babe, how about it?" Romeo is quite the emotional type. Let $R(t)$ denote his feelings at time point $t$. Following common practice, we will usually write $R$ instead of $R(t)$, making the time dependence implicit. The process which describes how Romeo’s feelings change is rather simple: it depends only on Romeo’s current feelings. We write: which is a linear differential equation. Note that this implicitly encodes how Romeo’s feelings change over time, since when we know $R$ at time point $t$, we can compute the direction and speed with which $R$ will change — the derivative denotes velocity. Our goal, however, is to find an explicit, closed-form expression for Romeo’s feelings at time point $t$. In this particular case, we can do this analytically: A differential equation describes how something changes; to kickstart the process, we need an initial condition $R_0$. This allows us to find the constant of integration $C$. In particular, assume that $R = R_0$ at $t = 0$, which leads to: such that: The left two panels of the figure below visualize how Romeo’s feelings change over time for $a > 0$ with initial condition $R_0 = 1$ (top) or $R_0 = -1$ (bottom). The right two panels show how his feelings change for $a < 0$ with $R_0 = 100$ (top) or $R_0 = -100$ (bottom). We conclude: Romeo is a simple guy, albeit with an emotion regulation problem. When the object of his affection is such that $a > 0$, his feelings will either grow exponentially towards mad love if he starts out with a positive first impression ($R_0 > 0$), or grow exponentially towards mad hatred if he starts out with a negative first impression ($R_0 < 0$). On the other hand, if $a < 0$, then regardless of his initial feelings, they will decay exponentially towards indifference. For $R_0 = 0$, Romeo’s feelings never change. For any other initial condition, we have uhindered, exponential growth when $a > 0$; it never stops. For any other initial condition and $a < 0$, we crash down to zero very rapidly. Thus $R = 0$ is a fixed point in both cases, which is stable for $a < 0$ but becomes unstable if $a > 0$. We can visualize this in phase space on a line. The phase space is filled with all possible trajectories because each point can serve as the initial condition. In the next section, a wonderful new episode in Romeo’s life begins: he meets Juliet. Introducing Juliet Juliet says, "Hey, it's Romeo, you nearly gave me a heart attack" He's underneath the window, she's singing, "Hey, la, my boyfriend's back You shouldn't come around here singing up at people like that Anyway, what you gonna do about it?" Life becomes more complicated for Romeo now that Juliet is in his life. It is their first real relationship, and they have much to learn. We start simple. Let $J$ denote Juliet’s feelings for Romeo, and let $R$ denote Romeo’s feelings for Juliet. We can extend our single linear differential equation from above to a system of two linear differential equations: Using the results from above, the solutions to the two differential equations are: where $R(t)$ and $J(t)$ give the trajectories of love for Romeo and Juliet, respectively, and $J_0$ is Juliet’s initial feeling towards Romeo at $t = 0$. In contrast to the one-dimensional phase diagram from above, we now have a two-dimensional picture which is known as a vector field.The Fibonacci sequence and linear algebra2019-07-28T13:30:00+00:002019-07-28T13:30:00+00:00https://fabiandablander.com/r/Fibonacci<p>Leonardo Bonacci, better known as Fibonacci, has influenced our lives profoundly. At the beginning of the $13^{th}$ century, he introduced the Hindu-Arabic numeral system to Europe. Instead of the Roman numbers, where <strong>I</strong> stands for one, <strong>V</strong> for five, <strong>X</strong> for ten, and so on, the Hindu-Arabic numeral system uses position to index magnitude. This leads to much shorter expressions for large numbers.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>While the history of the <a href="https://thonyc.wordpress.com/2017/02/10/the-widespread-and-persistent-myth-that-it-is-easier-to-multiply-and-divide-with-hindu-arabic-numerals-than-with-roman-ones/">numerical system</a> is fascinating, this blog post will look at what Fibonacci is arguably most well known for: the <em>Fibonacci sequence</em>. In particular, we will use ideas from linear algebra to come up with a closed-form expression of the $n^{th}$ Fibonacci number<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. On our journey to get there, we will also gain some insights about recursion in R.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<h1 id="the-rabbit-puzzle">The rabbit puzzle</h1>
<p>In Liber Abaci, Fibonacci poses the following question (paraphrasing):</p>
<blockquote>
<p>Suppose we have two newly-born rabbits, one female and one male. Suppose these rabbits produce another pair of female and male rabbits after one month. These newly-born rabbits will, in turn, also mate after one month, producing another pair, and so on. Rabbits never die. How many pairs of rabbits exist after one year?</p>
</blockquote>
<p>The Figure below illustrates this process. Every point denotes one rabbit pair over time. To indicate that every newborn rabbit pair needs to wait one month before producing new rabbits, rabbits that are not fertile yet are coloured in grey, while rabbits ready to procreate are coloured in red.</p>
<div style="text-align:center;">
<img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" />
</div>
<p>We can derive a linear recurrence relation that describes the Fibonacci sequence. In particular, note that rabbits never die. Thus, at time point $n$, all rabbits from time point $n - 1$ carry over. Additionally, we know that every fertile rabbit pair will produce a new rabbit pair. However, they have to wait one month, so that the amount of fertile rabbits equals the amount of rabbits at time point $n - 2$. Resultingly, the Fibonacci sequence {$F_n$}$_{n=1}^{\infty}$ is:</p>
<script type="math/tex; mode=display">F_n = F_{n-1} + F_{n-2} \enspace ,</script>
<p>for $n \geq 3$ and $F_1 = F_2 = 1$. Before we derive a closed-form expression that computes the $n^{th}$ Fibonacci number directly, in the next section, we play around with alternative, more straightforward solutions in R.</p>
<h1 id="implementation-in-r">Implementation in R</h1>
<p>We can write a wholly inefficient, but beautiful program to compute the $n^{th}$ Fibonacci number:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-2</span><span class="p">))</span></code></pre></figure>
<p>R takes roughly 5 seconds to compute the $30^{\text{th}}$ Fibonacci number; computing the $40^{\text{th}}$ number exhausts my patience. This recursive solution is not particularly efficient because R executes the function an unnecessary amount of times. For example, the call tree for <em>fib(5)</em> is:</p>
<ul>
<li><em>fib(5)</em></li>
<li><em>fib(4)</em> + <em>fib(3)</em></li>
<li>(<em>fib(3)</em> + <em>fib(2)</em>) + (<em>fib(2)</em> + <em>fib(1)</em>)</li>
<li>((<em>fib(2)</em> + <em>fib(1)</em>) + (<em>fib(1)</em> + <em>fib(0)</em>)) + ((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>)</li>
<li>((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>) + (<em>fib(1)</em> + <em>fib(0)</em>)) + ((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>)</li>
</ul>
<p>which shows that <em>fib(2)</em> was called three times. This is not necessary, as we can store the outcome of this function call instead of recomputing it every time. This technique is called <a href="https://en.wikipedia.org/wiki/Memoization">memoization</a> (see also the R package <a href="https://github.com/r-lib/memoise">memoise</a>). Implementing this leads to:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_mem</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cache</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">inside</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">cache</span><span class="p">))</span><span class="w">
</span><span class="n">fib</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">inside</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cache</span><span class="p">[[</span><span class="nf">as.character</span><span class="p">(</span><span class="n">n</span><span class="p">)]]</span><span class="w"> </span><span class="o"><<-</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-2</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">cache</span><span class="p">[[</span><span class="nf">as.character</span><span class="p">(</span><span class="n">n</span><span class="p">)]]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>This computes the $1000^{th}$ Fibonacci in a tenth of a second. We can, of course, write this sequentially, and also store all intermediate Fibonacci numbers. This also avoids memory issues brought about by the recursive implementation. Interestingly, although this algorithm seems like it should be $O(n)$, it is actually $O(n^2)$ since we are adding increasingly large numbers (for more on this, see <a href="https://catonmat.net/linear-time-fibonacci">here</a>).</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">num</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="m">-2</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">num</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The first 30 Fibonacci numbers are: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040.</p>
<p>This is a rapid increase, as made apparent by the left Figure below. The Figure on the right shows that there is structure in how the sequence grows.</p>
<p><img src="/assets/img/2019-07-28-Fibonacci.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>We will return to the structure in growth at the end of the blog post. First, we need to derive a closed-form expression of the $n^{th}$ Fibonacci number. In the next section, we take a step towards that by realizing that diagonal matrices make for easier computations.</p>
<h1 id="diagonal-matrices-are-good">Diagonal matrices are good</h1>
<p>Our goal is to get a closed form expression of the $n^{th}$ Fibonacci number. The first thing to note is that, due to linear recursion, we can view the Fibonacci numbers as applying a linear map. In particular, define $T \in \mathcal{L}(\mathbb{R}^2)$ by:</p>
<script type="math/tex; mode=display">T(x, y) = (y, x + y) \enspace .</script>
<p>We note that:</p>
<script type="math/tex; mode=display">T^n(0, 1) = (F_n, F_{n+1}) \enspace ,</script>
<p>which we will prove by induction. In particular, note that the base case $n = 1$:</p>
<script type="math/tex; mode=display">T^1(0, 1) = (1, 0 + 1) = (1, 1) = (F_1, F_2) \enspace ,</script>
<p>does in fact give the first two Fibonacci numbers. Now for the induction step: we assume that this holds for an arbitrary $n$, and we show that it holds for $n + 1$ using the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T^n(0, 1) &= (F_n, F_{n+1}) \\[1em]
T(T^n(0, 1)) &= T(F_n, F_{n+1}) \\[1em]
T^{n+1}(0, 1) &= (F_{n+1}, F_n + F_{n+1}) \\[1em]
T^{n+1}(0, 1) &= (F_{n+1}, F_{n+2}) \enspace .
\end{aligned} %]]></script>
<p>The last equality follows from the definition of the Fibonacci sequence, i.e., the fact that any number is equal to the sum of the previous two numbers. The matrix of this linear map with respect to the standard basis is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
A \equiv \mathcal{M}(T) = \begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix} \enspace , %]]></script>
<p>since $T(1, 0) = (0, 1)$ and $T(0, 1) = (1, 1)$. Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} y \\ x + y \end{pmatrix} \enspace . %]]></script>
<p>In the sequential R code for computing the Fibonacci numbers, we have applied the linear map $n$ times, which gave us the Fibonacci number we were interested in. We can write this in matrix form:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>If you were to compute, say, the $3^{th}$ Fibonacci number using this equation, you would have to multiply $A$ three times with itself. Now assume you had something like:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Using the above equation, the matrix powers would become trivial:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix}^n = \begin{pmatrix} \lambda_1^n & 0 \\ 0 & \lambda_2^n \end{pmatrix} \enspace . %]]></script>
<p>There would be no need to repeatedly engage in matrix multiplication; instead, we would arrive at the $n^{th}$ Fibonacci number using only scalar multiplication! Our task is thus as follows: find a new matrix for the linear map which is diagonal. To solve this, we will need eigenvalues and eigenvectors.</p>
<h1 id="finding-eigenvalues-and-eigenvectors">Finding eigenvalues and eigenvectors</h1>
<p>An eigenvector-eigenvalue pair $(v, \lambda)$ satisfies for $v \neq 0$ that:</p>
<script type="math/tex; mode=display">Tv = \lambda v \enspace ,</script>
<p>which means that for a particular vector $v$, the linear map only stretches the vector by a constant $\lambda$. Here’s the key: using the eigenvectors as basis, the matrix of the linear map is diagonal. This is because the matrix of our linear map, $A$, is defined by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Tv_1 &= A_{11} v_1 + A_{21} v_2 \\
Tv_2 &= A_{12} v_1 + A_{22} v_2 \enspace .
\end{aligned} %]]></script>
<p>Now since the basis consists only of eigenvectors, we know that $Tv_1 = \lambda v_1$ and $Tv_2 = \lambda v_2$, which implies that $A_{11} = \lambda_1$ and $A_{21} = 0$, as well as $A_{12} = 0$ and $A_{22} = \lambda_2$. For a wonderful explanation of eigenvalues and eigenvectors, see <a href="https://www.youtube.com/watch?v=PFDu9oVAE-g">this video</a> by 3Blue1Brown.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p>In order to find the eigenvalues and eigenvectors, note that the linear map satisfies the following two equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T(x, y) &= \lambda (x, y) \\[1em]
T(x, y) &= (y, x + y) \enspace .
\end{aligned} %]]></script>
<p>This leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda x &= y \\[1em]
\lambda y &= x + y \enspace .
\end{aligned} %]]></script>
<p>We substitute the first expression into the second one, yielding:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda^2 x &= x + y \\[1em]
(\lambda^2 - 1)x &= y \enspace ,
\end{aligned} %]]></script>
<p>which we now substitute into the first equation, which results in:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda x &= (\lambda^2 - 1)x\\[1em]
0 &= \lambda^2 - \lambda - 1\enspace .
\end{aligned} %]]></script>
<p>We can now apply the <em>quadratic formula</em> or “Mitternachtsformel”, as it is called in parts of Germany because students should know the formula when they are roused from sleep at midnight. We are neither in Germany, nor is it midnight, nor can I actually remember the formula, so let’s quickly derive it for our problem:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda^2 - \lambda - 1 &= 0 \\[1em]
\lambda^2 - \lambda &= 1 \\[1em]
4\lambda^2 - 4\lambda &= 4 \\[1em]
4\lambda^2 - 4\lambda + 1&= 4 + 1 \\[1em]
(2\lambda - 1)^2&= 4 + 1 \\[1em]
2\lambda - 1 &= \pm \sqrt{4 + 1} \\[1em]
\lambda &= \frac{1 \pm \sqrt{5}}{2} \enspace .
\end{aligned} %]]></script>
<p>Now that we have found both eigenvalues, we go hunting for the eigenvectors! We put the eigenvalue into the equations from above:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{1 \pm \sqrt{5}}{2} x &= y \\[1em]
\frac{1 \pm \sqrt{5}}{2} y &= x + y \enspace .
\end{aligned} %]]></script>
<p>If we set $x = 1$, then $y = \frac{1 \pm \sqrt{5}}{2}$. Thus, two eigenvectors are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
v_1 &= \left(1, \frac{1 + \sqrt{5}}{2}\right) \\[1em]
v_2 &= \left(1, \frac{1 - \sqrt{5}}{2}\right) \enspace .
\end{aligned} %]]></script>
<p>As a sanity check to see whether this is indeed true, we check whether $Tv_1 = \lambda_1 v_1$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Tv_1 &= \left(\frac{1 + \sqrt{5}}{2}, 1 + \frac{1 + \sqrt{5}}{2}\right) \\[1em]
\lambda v_1 &= \frac{1 + \sqrt{5}}{2} \left(1, \frac{1 + \sqrt{5}}{2}\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \left(\frac{1 + \sqrt{5}}{2}\right)^2\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \frac{1 + 2\sqrt{5} + 5}{4}\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \frac{3}{2} + \frac{\sqrt{5}}{2} \right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, 1 + \frac{1 + \sqrt{5}}{2} \right) \enspace ,
\end{aligned} %]]></script>
<p>which shows that the two expression are equal. Moreover, the dot product of the two eigenvectors is zero, which means that the two eigenvectors are linearly independent (as they should be). In the next section, we will find that <a href="https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation">the same territory can be described by different maps</a>.</p>
<h1 id="change-of-basis">Change of basis</h1>
<p>Now that we have found the eigenvalues and eigenvectors, we can create the matrix $D$ of the linear map $T$ which is diagonal with respect to the basis of eigenvectors:</p>
<script type="math/tex; mode=display">% <![CDATA[
D = \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \enspace . %]]></script>
<p>We are not done yet, however. Note that $D$ is the matrix of the linear map $T$ with respect to the basis that consists of both eigenvectors $v_1$ and $v_2$, <em>not</em> with respect to the standard basis. We have changed our coordinate system — our map — as indicated by the Figure below; the black coloured vectors are the standard basis vectors while the vectors coloured in red are our new basis vectors.</p>
<!-- <div style = "float: left; padding: 10px 10px 10px 0px;"> -->
<!-- ![](/assets/img/change-of-basis.png) -->
<div style="text-align:center;">
<img src="../assets/img/change-of-basis.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" />
</div>
<!-- </div> -->
<p>To build some intuition, let’s play around with representing $\omega$ in both the standard basis and our new eigenbasis. Any vector is a linear combination of the basis vectors. Let $a_1$ and $a_2$ be the coefficients for the standard basis such that:</p>
<script type="math/tex; mode=display">\omega = a_1 \begin{pmatrix} 1 \\ 0 \end{pmatrix} + a_2 \begin{pmatrix} 0 \\ 1 \end{pmatrix} \enspace .</script>
<p>Now because I have drawn it earlier, I know that $a_1 = -1$ and $a_2 = 0.3$. This is the representation of $\omega$ in the standard basis. How do we represent it in our eigenbasis? Well, using the eigenbasis the vector $\omega$ is still a linear combination of the basis vectors, but with different coefficients; denote them as $b_1$ and $b_2$. We thus have:</p>
<script type="math/tex; mode=display">\omega = b_1 \begin{pmatrix} 1 \\ \frac{1 + \sqrt{5}}{2} \end{pmatrix} + b_2 \begin{pmatrix} 1 \\ \frac{1 - \sqrt{5}}{2} \end{pmatrix} = a_1 \begin{pmatrix} 1 \\ 0 \end{pmatrix} + a_2 \begin{pmatrix} 0 \\ 1 \end{pmatrix} \enspace .</script>
<p>If we write this in matrix form, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} b_1 \\ b_2 \end{pmatrix} &= \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix}\\[1em]
\begin{pmatrix} b_1 \\ b_2 \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^{-1} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>Thus, we can represent a vector $a$ with basis $S$ in our new basis $E$ by computing:</p>
<script type="math/tex; mode=display">b = E^{-1} S \, a \enspace .</script>
<p>In our eigenbasis, the vector $\omega$ has the coordinates:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">lambda1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">lambda2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda2</span><span class="p">))</span><span class="w">
</span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">a</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] -0.1422
## [2,] -0.8578</code></pre></figure>
<p>This means we have the representation:</p>
<script type="math/tex; mode=display">\omega = -0.14 \begin{pmatrix} 1 \\ \frac{1 + \sqrt{5}}{2} \end{pmatrix} - 0.86 \begin{pmatrix} 1 \\ \frac{1 - \sqrt{5}}{2} \end{pmatrix} \enspace ,</script>
<p>which makes intuitive sense when you look at the Figure above. For another beautiful linear algebra video by 3Blue1Brown, this time about changing bases, see <a href="https://www.youtube.com/watch?v=P2LTAUO1TdA&t=598s">here</a>. In the next section, we will use what we have learned above to express the $n^{th}$ Fibonacci number in closed-form.</p>
<h1 id="closed-form-fibonacci">Closed-form Fibonacci</h1>
<p>Recall from above that our solution to finding the $n^{th}$ Fibonacci number in matrix form is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Now, we have swapped the non-diagonal matrix $A$ with the diagonal matrix $D$ by changing the basis from the standard basis to the eigenbasis. However, the vector $(0, 1)^T$ is still in the standard basis! In order to change its representation to the eigenbasis, we multiply it with $E^{-1}$, as discussed above. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^{-1} \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Let’s use this to compute, say, the $10^{th}$ Fibonacci number (which is 55) in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">D</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">lambda1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda2</span><span class="p">))</span><span class="w">
</span><span class="n">D</span><span class="o">^</span><span class="m">10</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] 55.003636
## [2,] -0.003636</code></pre></figure>
<p>Ha! This didn’t quite work, did it? We got the answer for $F_{10}$ roughly when rounding, but $F_{11}$ is completely off. What did we miss? Well, this is in fact the correct answer — it is just in the wrong basis! We have to convert this from the eigenbasis to the standard basis. To do this, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
b &= E^{-1} S \, a \\
E b &= S \, a \\
E b &= a \enspace ,
\end{aligned} %]]></script>
<p>since $S$ is the identity matrix. Thus, all we have to do is to multiply with $E$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">D</span><span class="o">^</span><span class="m">10</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] 55
## [2,] 89</code></pre></figure>
<p>which is the correct solution. To get the closed-form solution algebraically, we first invert the matrix $E$:</p>
<script type="math/tex; mode=display">% <![CDATA[
E^{-1} = -\frac{1}{\sqrt{5}} \begin{pmatrix} \frac{1 - \sqrt{5}}{2} & -1 \\ - \frac{1 + \sqrt{5}}{2} & 1\end{pmatrix} \enspace , %]]></script>
<p>and we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n -\frac{1}{\sqrt{5}} \begin{pmatrix} \frac{1 - \sqrt{5}}{2} & -1 \\ - \frac{1 + \sqrt{5}}{2} & 1\end{pmatrix}\begin{pmatrix} 0 \\ 1 \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n \begin{pmatrix} -1 \\ 1 \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} -\left(\frac{1 + \sqrt{5}}{2}\right)^n \\ \left(\frac{1 - \sqrt{5}}{2}\right)^n \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} -\left(\frac{1 + \sqrt{5}}{2}\right)^n + \left(\frac{1 - \sqrt{5}}{2}\right)^n \\ -\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} + \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \end{pmatrix} \\[1em]
&= \frac{1}{\sqrt{5}} \begin{pmatrix} \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \\ \left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>The closed-form expression of the $n^{th}$ Fibonacci number is thus given by:</p>
<script type="math/tex; mode=display">F_n = \frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \right] \enspace .</script>
<p>We verify this in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_closed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="m">1</span><span class="o">/</span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(((</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">((</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">fib_closed</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">30</span><span class="p">)))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1 1 2 3 5 8 13 21 34 55
## [11] 89 144 233 377 610 987 1597 2584 4181 6765
## [21] 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040</code></pre></figure>
<h1 id="the-golden-ratio">The golden ratio</h1>
<p>In the above section, we have derived a closed-form expression of the $n^{th}$ Fibonacci number. In this section, we return to an observation we have made at the beginning: there is structure in how the Fibonacci numbers grow. Johannes Kepler, after whom the university in my home town is named, (re)discovered that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lim_{n \rightarrow \infty} \frac{F_{n+1}}{F_n} &= \lim_{n \rightarrow \infty} \frac{\frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \right]}{\frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \right]} \\[1em]
&= \lim_{n \rightarrow \infty} \frac{\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1}}{\left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n} \\[1em]
&= \frac{\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1}}{\left(\frac{1 + \sqrt{5}}{2}\right)^n} \\[1em]
&= \frac{1 + \sqrt{5}}{2} \approx 1.618 \enspace ,
\end{aligned} %]]></script>
<p>which is the <a href="https://en.wikipedia.org/wiki/Golden_ratio">golden ratio</a>. The golden ratio $\phi$ denotes that the ratio of two parts is equal to the ratio of the sum of the parts to the larger part, i.e., for $a > b > 0$:</p>
<script type="math/tex; mode=display">\phi \equiv \frac{a}{b} = \frac{a + b}{a} \enspace .</script>
<p>We have observed this empirically in the first Figure, which visualized the differences in the log of two consecutive Fibonacci numbers, and which yielded already for small $n$:</p>
<script type="math/tex; mode=display">\text{log} \, F_{n+1} - \text{log} \, F_n = \text{log} \, \frac{F_{n + 1}}{F_n} \approx 0.4812 \enspace ,</script>
<p>which exponentiated yields the golden ratio. Observe that $\left(\frac{1 - \sqrt{5}}{2}\right)^n$ goes to zero very quickly as $n$ grows so that we can compute the $n^{th}$ Fibonacci number by:</p>
<script type="math/tex; mode=display">F_n = \left \lfloor \frac{1}{\sqrt{5}} \phi^n \right \rceil \enspace ,</script>
<p>where we simply round to the nearest integer. To finally answer Fibonacci’s puzzle:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_golden</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="nf">round</span><span class="p">(((</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w">
</span><span class="n">fib_golden</span><span class="p">(</span><span class="m">12</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 144</code></pre></figure>
<p>After a mere twelve months of incest, there are 144 rabbit pairs!<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></p>
<p>There are various <a href="https://en.wikipedia.org/wiki/Generalizations_of_Fibonacci_numbers">generalizations</a> of the Fibonacci sequence. One such generalization is to allow higher orders $k$ in the sequence, which for $k = 3$ is known as the <a href="https://www.youtube.com/watch?v=fMJflV_GUpU">Tribonacci sequence</a>. Our approach for $k = 2$ can be straightforwardly generalized to account for any order $k$ (if you want to go down a rabbit hole, see for example <a href="https://math.stackexchange.com/questions/41667/fibonacci-tribonacci-and-other-similar-sequences">this</a>).</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have taken a detailed look at the Fibonacci sequence. In particular, we saw that it is the answer to a puzzle about procreating rabbits, and how to speed up a recursive algorithm for finding the $n^{th}$ Fibonacci number. We then used ideas from linear algebra to arrive at a closed-form expression of the $n^{th}$ Fibonacci number. Specifically, we have noted that the Fibonacci sequence is a linear recurrence relation — it can be viewed as repeatedly applying a linear map. With this insight, we observed that the matrix of the linear map is non-diagonal, which makes repeated execution tedious; diagonal matrices, on the other hand, are easy to multiply. We arrived at a diagonal matrix by changing the basis from the standard basis to the basis of eigenvectors, which led to a diagonal matrix of eigenvalues for the linear map. With this representation, the $n^{th}$ Fibonacci number is available in closed-form. In order to get it into the standard basis, we had to change basis back from the eigenbasis. We also saw how the Fibonacci numbers relate to the golden ratio $\phi$.</p>
<hr />
<p>I would like to thank Don van den Bergh, Jonas Haslbeck, and Sophia Crüwell for helpful comments on this blog post.</p>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is the main reason why the Hinu-Arabic numeral system took over. The belief that it is easier to multiply and divide using Hindu-Arabic numerals is <a href="https://thonyc.wordpress.com/2017/02/10/the-widespread-and-persistent-myth-that-it-is-easier-to-multiply-and-divide-with-hindu-arabic-numerals-than-with-roman-ones/">incorrect</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This blog post is inspired by exercise 16 on p. 161 in <a href="http://linear.axler.net/">Linear Algebra Done Right</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I have learned that there is already (very good) ink spilled on this topic, see for example <a href="https://bosker.wordpress.com/2011/04/29/the-worst-algorithm-in-the-world/">here</a> and <a href="https://bosker.wordpress.com/2011/07/27/computing-fibonacci-numbers-using-binet%E2%80%99s-formula/">here</a>. A nice essay is also <a href="https://opinionator.blogs.nytimes.com/2012/09/24/proportion-control/?mtrref=undefined&gwh=C0500419D79A9E5B64F17ABC970C5125&gwt=pay">this</a> piece by Steve Strogatz, who, by the way, wrote a wonderful book called <a href="https://www.goodreads.com/book/show/354421.Sync">Sync</a>. He’s also been on Sean Carroll’s Mindscape podcast, listen <a href="https://www.preposterousuniverse.com/podcast/2019/04/08/episode-41-steven-strogatz-on-synchronization-networks-and-the-emergence-of-complex-behavior/">here</a>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>If you forget everything that is written in this blog post, but through it were made aware of the videos by 3Blue1Brown (or <a href="https://www.numberphile.com/podcast/3blue1brown">Grant Sanderson</a>, as he is known in the real world), then I consider this blog post a success. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The downside of the closed-form solution is that it is difficult to calculate the power of the square root with high accuracy. In fact, <em>fib_golden</em> is incorrect for $n > 70$. Our <em>fib_mem</em> implementation is also incorrect, but only for $n > 93$. (I’ve compared it against Fibonacci numbers calculated from <a href="https://www.miniwebtool.com/list-of-fibonacci-numbers/?number=100">here</a>). <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderLeonardo Bonacci, better known as Fibonacci, has influenced our lives profoundly. At the beginning of the $13^{th}$ century, he introduced the Hindu-Arabic numeral system to Europe. Instead of the Roman numbers, where I stands for one, V for five, X for ten, and so on, the Hindu-Arabic numeral system uses position to index magnitude. This leads to much shorter expressions for large numbers.1 While the history of the numerical system is fascinating, this blog post will look at what Fibonacci is arguably most well known for: the Fibonacci sequence. In particular, we will use ideas from linear algebra to come up with a closed-form expression of the $n^{th}$ Fibonacci number2. On our journey to get there, we will also gain some insights about recursion in R.3 The rabbit puzzle In Liber Abaci, Fibonacci poses the following question (paraphrasing): Suppose we have two newly-born rabbits, one female and one male. Suppose these rabbits produce another pair of female and male rabbits after one month. These newly-born rabbits will, in turn, also mate after one month, producing another pair, and so on. Rabbits never die. How many pairs of rabbits exist after one year? The Figure below illustrates this process. Every point denotes one rabbit pair over time. To indicate that every newborn rabbit pair needs to wait one month before producing new rabbits, rabbits that are not fertile yet are coloured in grey, while rabbits ready to procreate are coloured in red. We can derive a linear recurrence relation that describes the Fibonacci sequence. In particular, note that rabbits never die. Thus, at time point $n$, all rabbits from time point $n - 1$ carry over. Additionally, we know that every fertile rabbit pair will produce a new rabbit pair. However, they have to wait one month, so that the amount of fertile rabbits equals the amount of rabbits at time point $n - 2$. Resultingly, the Fibonacci sequence {$F_n$}$_{n=1}^{\infty}$ is: for $n \geq 3$ and $F_1 = F_2 = 1$. Before we derive a closed-form expression that computes the $n^{th}$ Fibonacci number directly, in the next section, we play around with alternative, more straightforward solutions in R. Implementation in R We can write a wholly inefficient, but beautiful program to compute the $n^{th}$ Fibonacci number: This is the main reason why the Hinu-Arabic numeral system took over. The belief that it is easier to multiply and divide using Hindu-Arabic numerals is incorrect. ↩ This blog post is inspired by exercise 16 on p. 161 in Linear Algebra Done Right. ↩ I have learned that there is already (very good) ink spilled on this topic, see for example here and here. A nice essay is also this piece by Steve Strogatz, who, by the way, wrote a wonderful book called Sync. He’s also been on Sean Carroll’s Mindscape podcast, listen here. ↩Spurious correlations and random walks2019-06-29T10:00:00+00:002019-06-29T10:00:00+00:00https://fabiandablander.com/r/Spurious-Correlation<p>The number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: <em>completely independent processes can be correlated substantially</em>.</p>
<h2 id="ar1-processes-and-random-walks">AR(1) processes and random walks</h2>
<p>Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as:</p>
<script type="math/tex; mode=display">Y_t = \phi \, Y_{t-1} + \epsilon_t \enspace ,</script>
<p>where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a <em>random walk</em>. We can simulate from these using the following code:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">simulate_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">t</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="p">[</span><span class="n">t</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">phi</span><span class="o">*</span><span class="n">y</span><span class="p">[</span><span class="n">t</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">y</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The following R code simulates data from three independent random walks and an AR(1) process with $\phi = 0.5$; the Figure below visualizes them.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>As we can see from the plot, the AR(1) process seems pretty well-behaved. This is in contrast to the three random walks: all of them have an initial upwards trend, after which the red line keeps on growing, while the blue line makes a downward jump. In contrast to AR(1) processes, random walks are <em>not stationary</em> since their variance is not constant across time. For some very good lecture notes on time-series analysis, see <a href="https://www.economodel.com/time-series-analysis">here</a>.</p>
<h2 id="spurious-correlations-of-random-walks">Spurious correlations of random walks</h2>
<p>If we look at the correlations of these three random walks across time points, we find that they are substantial:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="nf">round</span><span class="p">(</span><span class="n">cor</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">red</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw1</span><span class="p">,</span><span class="w"> </span><span class="n">green</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw2</span><span class="p">,</span><span class="w"> </span><span class="n">blue</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw3</span><span class="p">)),</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## red green blue
## red 1.00 -0.49 -0.29
## green -0.49 1.00 0.59
## blue -0.29 0.59 1.00</code></pre></figure>
<p>I hope that this is at least a little bit of a shock. Upon reflection, however, it is clear that we are blundering: computing the correlation across time ignores the dependency between data points that is so typical of time-series data. To get more data about what is going on, we conduct a small simulation study.</p>
<p>In particular, we want to get an intuition of how this spurious correlation behaves with increasing sample sizes. We therefore simulate two independent random walks for sample sizes $n \in [50, 100, 200, 500, 1000, 2000]$ and compute their Pearson correlation, the test-statistic, and whether $p < \alpha$, where we set $\alpha$ to some an arbitrary value, say $\alpha = 0.05$. We repeated this 100 times and report the average of these quantities.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">)</span><span class="w">
</span><span class="n">comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ns</span><span class="p">)</span><span class="w">
</span><span class="n">ncomb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">comb</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncomb</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ix'</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'cor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">,</span><span class="w"> </span><span class="s1">'pval'</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ncomb</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cor.test</span><span class="p">(</span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">p.value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">tab</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">avg_abs_corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">cor</span><span class="p">)),</span><span class="w">
</span><span class="n">avg_abs_tstat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">tstat</span><span class="p">)),</span><span class="w">
</span><span class="n">percent_sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">pval</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">data.frame</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">tab</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## n avg_abs_corr avg_abs_tstat percent_sig
## 1 50 0.41 3.57 0.71
## 2 100 0.46 6.58 0.85
## 3 200 0.45 8.88 0.85
## 4 500 0.37 10.63 0.86
## 5 1000 0.41 17.05 0.88
## 6 2000 0.39 23.39 0.97</code></pre></figure>
<p>We observe that the average absolute correlation is very similar across $n$, but the test statistic grows with increased $n$, which naturally results in many more false rejections of the null hypothesis of no correlation between the two random walks.</p>
<p>To my knowledge, Granger and Newbold (1974) were the first to point out this puzzling fact.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> They regress one random walk onto the other instead of computing the Pearson correlation. (Note that the test statistic is the same). In a regression setting, we write:</p>
<script type="math/tex; mode=display">Y = \beta_0 + \beta_1 X + \epsilon \enspace ,</script>
<p>where we assume that $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ (see also a <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">previous</a> blog post). This is evidently violated when performing linear regression on two random walks, as demonstrated by the residual plot below.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Similar as above, we can have an AR(1) process on the residuals:</p>
<script type="math/tex; mode=display">\epsilon_t = \delta \epsilon_{t-1} + \eta_t \enspace ,</script>
<p>and test whether $\delta = 0$. We can do so using the <a href="https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic">Durbin-Watson test</a>, which yields:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">car</span><span class="o">::</span><span class="n">durbinWatsonTest</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## lag Autocorrelation D-W Statistic p-value
## 1 0.9357562 0.08623868 0
## Alternative hypothesis: rho != 0</code></pre></figure>
<p>This indicates substantial autocorrelation, violating our modeling assumption of independent residuals. In the next section, we look at the deeper mathematical reasons for why we get such spurious correlation. In the Post Scriptum, we relax the constraint that $\phi = 1$ and look at how spurious correlation behaves for AR(1) processes.</p>
<!-- In the next section, we will look more formally into the curious fact that two independent random walks are correlated. To understand why even with large $n$ the estimation goes awry, we have to make an excursion into asymptotia. -->
<h2 id="inconsistent-estimation">Inconsistent estimation</h2>
<p>The simulation results from the random walk simulations showed that the average (absolute) correlation stays roughly constant, while the test statistic increases with $n$. This indicates a problem with our estimator for the correlation. Because it is slightly easier to study, we focus on the regression parameter $\beta_1$ instead of the Pearson correlation. <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">Recall</a> that our regression estimate is</p>
<script type="math/tex; mode=display">\hat{\beta}_1 = \frac{\sum_{t=1}^N (x_t - \bar{x})(y_t - \bar{y})}{\sqrt{\sum_{t=1}^N (x_t - \bar{x})^2 \sum_{t=1}^N (y_t - \bar{y})^2}} \enspace ,</script>
<p>where $\bar{x}$ and $\bar{y}$ are the empirical means of the realizations $x_t$ and $y_t$ of the AR(1) processes $X_t$ and $Y_t$, respectively. The test statistic associated with the null hypothesis $\beta_1 = 0$ is</p>
<script type="math/tex; mode=display">t_{\text{statistic}} := \frac{\hat{\beta_1} - 0}{se(\hat{\beta_1})} = \frac{\hat{\beta_1}}{\hat{\sigma} / \sqrt{\sum_{t=1}^N (x_t - \bar{x})^2}} \enspace ,</script>
<p>where $\hat{\sigma}$ is the estimated standard deviation of the error. In simple linear regression, the test statistic follows a t-distribution with $n - 2$ degrees of freedom (it takes two parameters to fit a straight line). In the case of independent random walks, however, the test statistic does not have a limiting distribution; in fact, as $n \rightarrow \infty$, the distribution of $t_{\text{statistic}}$ diverges (Phillips, 1986).</p>
<p>To get an intuition for this, we plot the bootstrapped sampling distributions for $\beta_1$ and $t_{\text{statistic}}$, both for the case of regressing one independent AR(1) process onto another, and for random walk regression.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">regress_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">summary</span><span class="p">(</span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="p">)))[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bootstrap_limit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">ns</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'b1'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">)</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">ns</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">regress_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">coefs</span><span class="p">)</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ix</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">)</span><span class="w">
</span><span class="n">res_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bootstrap_limit</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w">
</span><span class="n">res_rw</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bootstrap_limit</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span></code></pre></figure>
<p>The Figure below illustrates how things go wrong when regressing one independent random walk onto the other. In contrast to the estimate for the AR(1) regression, the estimate $\hat{\beta}_1$ does not decrease in the case of a random walk regression. Instead, it stays roughly within $[-0.75, 0.75]$ across all $n$. This shines further light on the initial simulation results that the average correlation stays roughly the same. Moreover, in contrast AR(1) regression for which the distribution of the test statistic does not change, the distribution of the test statistic for the random walk regression seems to diverge. This explains why we the proportion of false positives increases with $n$.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-9-1.png" title="plot of chunk unnamed-chunk-9" alt="plot of chunk unnamed-chunk-9" style="display: block; margin: auto;" /></p>
<p>Rigorous arguments of the above statements can be found in Phillips (1986) and Hamilton (1994, pp. 577).<sup id="fnref:4"><a href="#fn:4" class="footnote">3</a></sup> The explanations feature some nice asympotic arguments which I would love go into in detail; however, I’m currently in Santa Fe for a summer school that has a very tightly packed programme. On that note: it is <a href="https://www.santafe.edu/engage/learn/schools/sfi-complex-systems-summer-school">very, very cool</a>. You should definitely apply next year! In addition to the stimulating lectures, wonderful people, and exciting projects, the surroundings are stunning<sup id="fnref:5"><a href="#fn:5" class="footnote">4</a></sup>.</p>
<div style="text-align:center;">
<img src="../assets/img/IAIA.jpeg" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="720" height="620" />
</div>
<!-- ### Brownian Motion -->
<!-- The type of random walk we focused on in this blog post takes place in discrete, equidistant time steps.[^3] If we take the limit of $n \rightarrow \infty$, however, we move from a discrete time random walk to a continuous time Brownian motion. The gist of the argument is to make the difference $\Delta Y_t$ between time points $Y_{t+1}$ and $Y_t$ infinitesimally small. Recall that the Gaussian distribution is [closed under addition](https://fdabl.github.io/statistics/Two-Properties.html), and that -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- Y_t &= \sum_{i=1}^t \eta_i \sim \mathcal{N}(0, t \cdot \sigma^2) \enspace \\[1em] -->
<!-- \Delta Y_t &= Y_{t+1} - Y_{t} = \sum_{i=1}^{t+1} \eta_i - \sum_{j=1}^t \eta_j = \eta_t \sim \mathcal{N}(0, \sigma^2) \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- We may cut $\eta_t$ into $n$ pieces -->
<!-- $$ -->
<!-- \eta_t = \eta_{1t} + \eta_{2t} + \ldots + \eta_{nt} \enspace , -->
<!-- $$ -->
<!-- where $\eta_{it} \sim \mathcal{N}(0, \frac{1}{n})$. Therefore, as we increase $n$, the discrete-time process is defined at a finer and finer grid. For $n \rightarrow \infty$, this results into the continuous-time Brownian motion, which we denote as $W(t)$, where $W: t \in [0, 1] \rightarrow \mathbb{R}$. -->
<!-- ## Solutions -->
<!-- Hamilton (1994, p. 562) discusses three solutions. One of them is to *difference* the data before doing the regression, i.e., -->
<!-- $$ -->
<!-- \Delta Y_t = \beta_0 + \beta_1 \Delta X_t + \epsilon_t \enspace , -->
<!-- $$ -->
<!-- where $\Delta Y_t = Y_{t+1} - Y_t$. This does in fact work: -->
<!-- ```{r} -->
<!-- broom::tidy(lm(diff(rw1) ~ diff(rw2))) -->
<!-- ``` -->
<!-- ```{r, echo = FALSE} -->
<!-- n <- 1000 -->
<!-- dat <- matrix(0, nrow = n, ncol = 2) -->
<!-- B <- cbind( -->
<!-- c(.4, .2), -->
<!-- c(-.2, .4) -->
<!-- ) -->
<!-- for (i in seq(2, n)) { -->
<!-- z <- rnorm(1) -->
<!-- # dat[i, ] <- dat[i-1, ] %*% B + rnorm(2) -->
<!-- dat[i, ] <- c(.8, .4) * z + rnorm(2) -->
<!-- } -->
<!-- ``` -->
<!-- Why? Let $\eta_t$ and $\psi_t$ denote the errors of the two processes $Y$ and $X$, respectively, distributed according to zero-mean Gaussian with variances $\sigma_y$ and $\sigma_x$. We write -->
<!-- $$ -->
<!-- \Delta Y_t = \sum_{i=1}^{t+1} \eta_i - \sum_{i=1}^{t} \eta_i = \eta_{t+1} \sim \mathcal{N}(0, \sigma_y^2) \\[1em] -->
<!-- \Delta X_t = \sum_{i=1}^{t+1} \psi_i - \sum_{i=1}^{t} \eta_i = \psi_{t+1} \sim \mathcal{N}(0, \sigma_x^2) \enspace . -->
<!-- $$ -->
<!-- Now, since the respective differences are independent of each other, their correlation will be zero. -->
<!-- However, Hamilton notes that if the time-series are really stationary ($\vert \phi \lvert < 1$), then this can result in misspecified regression. Moreover, if $Y$ and $X$ are non-stationary but *cointegrated processes*, then this also will result in misspecification. -->
<h2 id="conclusion">Conclusion</h2>
<p>“Correlation does not imply causation” is a common response to apparently spurious correlation. The idea is that we observe spurious associations because we do not have the full causal picture, as in the example of storks and human babies. In this blog post, we have seen that spurious correlation can be due to solely statistical reasons. In particular, we have seen that two independent random walks can be highly correlated. This can be diagnosed by looking at the residuals, which will <em>not</em> be independent and identically distributed, but will show a pronounced autocorrelation.</p>
<p>The mathematical explanation for the spurious correlation is not trivial. Using simulations, we found that the estimate of $\beta_1$ does not converge to the true value in the case of regressing one independent random walk onto another. Moreover, the test statistic diverges, meaning that with increasing sample size we are almost certain to reject the null hypothesis of no association. The spurious correlation occurs because our estimate is not consistent, which is a purely statistical explanation that does not invoke causal reasoning.</p>
<hr />
<p><em>I want to thank Toni Pichler and Andrea Bacilieri for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<!-- ### Mean and variance of AR(1) and random walk -->
<!-- To better understand the differences between AR(1) processes and random walks, we look at their respective first two moments. We write out the process for some window of length $j$, and then recursively substitute: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- Y_t &= \phi \, Y_{t-1} + \epsilon_t \\[.5em] -->
<!-- &= \phi \, \left(\phi \, Y_{t-2} + \epsilon_{t-1}\right) + \epsilon_t \\[.5em] -->
<!-- &= \phi \, \left(\phi \, \left(\phi \, Y_{t-3} + \epsilon_{t-2}\right) + \epsilon_{t-1}\right) + \epsilon_t \\[.5em] -->
<!-- &= \vdots \\[.5em] -->
<!-- &= \phi^{j + 1} \, Y_{t - (j + 1)} + \sum_{i=t}^{t - (j + 1)} \phi^i \epsilon_{t-i} \\[.5em] -->
<!-- &= \sum_{i=0}^{t-1} \phi^i \epsilon_{t-i} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where we assume that $Y_0 = 0$ is fixed. Let's compute the first two moments of this process. Exploiting linearity, we write: -->
<!-- $$ -->
<!-- \mathbb{E}[Y_t] = \mathbb{E}\left[\sum_{i=0}^{t-1} \phi^i \epsilon_{t-i}\right] = \sum_{i=0}^{t-1} \mathbb{E}\left[\phi^i \epsilon_{t-i}\right] = \sum_{i=0}^{t-1} \phi^i \mathbb{E}\left[\epsilon_{t-i}\right] = 0 \enspace . -->
<!-- $$ -->
<!-- This is also true for $\phi = 1$, i.e., a random walk. For the variance, we write: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{Var}\left[Y_t\right] &= \mathbb{E}\left[\left(Y_t - \mathbb{E}[Y_t]\right)^2\right] -->
<!-- = \mathbb{E}\left[Y_t^2\right] -->
<!-- = \mathbb{E}\left[\left(\sum_{i=0}^{t-1} \phi^i \epsilon_{t-i}\right)^2\right] \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where we split the quadratic into ["diagonal"](https://math.stackexchange.com/questions/125435/what-is-the-opposite-of-a-cross-term) terms and cross-terms, the latter of which have expectation zero by our assumption that the residuals are independent: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{Var}\left[Y_t\right] &= \mathbb{E}\left[\sum_{i=0}^{t - 1} \left(\phi^i \epsilon_{t-i}\right)^2 + \sum_{i=0}^{t - 1} \sum_{j\neq i}^{t - 1} \left(\phi^i \epsilon_{t-i}\right) \left(\phi^j \epsilon_{t-j}\right)\right] \\[.5em] -->
<!-- &= \mathbb{E}\left[\sum_{i=0}^{t - 1} \left(\phi^i \epsilon_{t-i}\right)^2\right] \\[.5em] -->
<!-- &= \sum_{i=0}^{t - 1} \mathbb{E}\left[\left(\phi^i \epsilon_{t-i}\right)^2\right] \\[.5em] -->
<!-- &= \sum_{i=0}^{t - 1} \left(\phi^i\right)^2 \mathbb{E}\left[\epsilon_{t-i}^2\right] \\[.5em] -->
<!-- &= \sigma^2\sum_{i=0}^{t - 1} \left(\phi^2\right)^i \\[.5em] -->
<!-- &= \sigma^2 \frac{1}{1 - \phi^2} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where the last line follows when $N \rightarrow \infty$ for $\vert\phi\vert < 1$ from a geometric series. For a random walk, however, this is not a geometric series anymore; it therefore does not converge, and the variance of a random walk does not exist. -->
<h3 id="spurious-correlation-of-ar1-processes">Spurious correlation of AR(1) processes</h3>
<p>In the main text, we have looked at how the spurious correlation behaves for a random walk. Here, we study how the spurious correlation behaves as a function of $\phi \in [0, 1]$. We focus on sample sizes of $n = 200$, and adapt the simulation code from above.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">200</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">phis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.02</span><span class="p">)</span><span class="w">
</span><span class="n">comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phis</span><span class="p">)</span><span class="w">
</span><span class="n">ncomb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">comb</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncomb</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ix'</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'phi'</span><span class="p">,</span><span class="w"> </span><span class="s1">'cor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">,</span><span class="w"> </span><span class="s1">'pval'</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ncomb</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">phi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cor.test</span><span class="p">(</span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">),</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">p.value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">phi</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">avg_abs_corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">cor</span><span class="p">)),</span><span class="w">
</span><span class="n">avg_abs_tstat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">tstat</span><span class="p">)),</span><span class="w">
</span><span class="n">percent_sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">pval</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>The Figure below shows that the issue of spurious correlation gets progressively worse as the AR(1) process approaches a random walk (i.e., $\phi = 1$). While this is true, the regression estimate remains consistent.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<h2 id="references">References</h2>
<ul>
<li>Granger, C. W., & Newbold, P. (<a href="http://wolfweb.unr.edu/~zal/STAT758/Granger_Newbold_1974.pdf">1974</a>). Spurious regressions in econometrics. <em>Journal of Econometrics, 2</em>(2), 111-120.</li>
<li>Hamilton, J. D. (<a href="https://press.princeton.edu/titles/5386.html">1994</a>). Time Series Analysis. P. Princeton, US: Princeton University Press.</li>
<li>Kuiper, R. M., & Ryan, O. (<a href="https://www.tandfonline.com/doi/full/10.1080/10705511.2018.1431046">2018</a>). Drawing conclusions from cross-lagged relationships: Re-considering the role of the time-interval. <em>Structural Equation Modeling: A Multidisciplinary Journal, 25</em>(5), 809-823.</li>
<li>Phillips, P. C. (<a href="http://dido.econ.yale.edu/korora/phillips/pubs/art/a044.pdf">1986</a>). Understanding spurious regressions in econometrics. <em>Journal of Econometrics, 33</em>(3), 311-340.</li>
<li>Matthews, R. Storks deliver babies (p = 0.008) (<a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9639.00013?casa_token=cWUllTD9P14AAAAA:PRERZz-uS2z9xX3DGt0-Qize94FuZuw-35s-2ECfUDY9Oi3J1m83cZh8EBHGlGh7fwQ2WHShOQuwB-YO">2000</a>). <em>Teaching Statistics 22</em>(2), 36–38.</li>
</ul>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>There are, of course, many <a href="https://www.tylervigen.com/spurious-correlations">more</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Thanks to Toni Pichler for drawing my attention to the fact that independent random walks are correlated, and Andrea Bacilieri for providing me with the classic references. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>Moreover, one way to avoid the spurious correlation is to <em>difference</em> the time-series. For other approaches, see Hamilton (1994, pp. 561). <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>This awesome picture was made by Luther Seet. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderThe number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.1 In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: completely independent processes can be correlated substantially. AR(1) processes and random walks Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as: where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a random walk. We can simulate from these using the following code: There are, of course, many more. ↩Bayesian modeling using Stan: A case study2019-05-30T10:00:00+00:002019-05-30T10:00:00+00:00https://fabiandablander.com/r/Law-of-Practice<link rel="stylesheet" href="../highlight/styles/default.css" />
<script src="../highlight/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<script>$('pre.stan code').each(function(i, block) {hljs.highlightBlock(block);});</script>
<p>Practice makes better. And faster. But what exactly is the relation between practice and reaction time? In this blog post, we will focus on two contenders: the <em>power law</em> and <em>exponential</em> function. We will implement these models in Stan and extend them to account for learning plateaus and the fact that, with increased practice, not only the mean reaction time but also its variance decreases. We will contrast two perspectives on predictive model comparison: a <em>(prior) predictive</em> perspective based on marginal likelihoods, and a <em>(posterior) predictive</em> perspective based on leave-one-out cross-validation. So let’s get started!<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<h1 id="two-models">Two models</h1>
<p>We can model the relation between reaction time (in seconds) and the number of practice trials as a power law function. Let $f: \mathbb{N} \rightarrow \mathbb{R}^+$ be a function that maps the number of trials to reaction times. We write</p>
<script type="math/tex; mode=display">f_p(N) = \alpha + \beta N^{-r} \enspace ,</script>
<p>where $\alpha$ is a lower bound (one cannot respond faster than that due to processing and motor control limits); $\beta$ is the learning gain from practice with respect to the first trial ($N = 1$); $N$ indexes the particular trial; and $r$ is the learning rate. Similarly, we can write</p>
<script type="math/tex; mode=display">f_e(N) = \alpha + \beta e^{-rN} \enspace ,</script>
<p>where the parameters have the same interpretation, except that $\beta$ is the learning gain from practice compared to no practice ($N = 0$).<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<p>What is the main difference between those two functions? <em>The exponential model assumes a constant learning rate, while the power model assumes diminishing returns</em>. To see this, let $\alpha = 0$, and ignore for the moment that $N$ is discrete. Taking the derivative for the power law model results in</p>
<script type="math/tex; mode=display">\frac{\partial f_p(N)}{\partial N} = -r\beta N^{-r - 1} = (-r/N) \, \beta N^{-r} = (-r/N) \, f_p(N) \enspace ,</script>
<p>which shows that the <em>local learning rate</em> — the change in reaction time as a function of $N$ — is $-r/N$; it depends on how many trials have been completed previously. The more one has practiced, the smaller the local learning rate $-r / N$. The exponential function, in contrast, shows no such dependency on practice:</p>
<script type="math/tex; mode=display">\frac{\partial f_e(N)}{\partial N} = -r\beta e^{-rN} = -r \, f_e(N) \enspace .</script>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>The figure above visualizes two data sets, generated from either a power law (left) or an exponential model (right), as well as the maximum likelihood fit of both models to these data. It is rather difficult to tell which model performs best just by eyeballing the fit. We thus need to engage in a more formal way of comparing models.</p>
<h1 id="two-perspectives-on-prediction">Two perspectives on prediction</h1>
<p>Let’s agree that the best way to compare models is to look at predictive accuracy. Predictive accuracy with respect to what? The figure below illustrates two different answers one might give. The grey shaded surface represents unobserved data; the white island inside is the observed data; the model is denoted by $\mathcal{M}$.</p>
<p><img src="../assets/img/prediction-perspectives.png" align="center" style="padding: 10px 10px 10px 10px;" /></p>
<p>On the left, the model makes predictions <em>before</em> seeing any data by means of its <em>prior predictive distribution</em>. The predictive accuracy is then evaluated on the actually observed data. In contrast, on the right, the model makes predictions <em>after</em> seeing the data by means of its <em>posterior predictive distribution</em>. In principle, its predictive accuracy is evaluated on data one does not observe (visualized as the grey area). One can estimate this expected <em>out-of-sample</em> predictive accuracy by cross-validation procedures which partition the observed data into a training and test set. The model only sees the training set, and makes predictions for the unseen test set.</p>
<p>One key practical distinction between these two perspectives is how predictions are generated. On the left, predictions are generated from the prior. On the right, the prior first gets updated to the posterior using the observed data, and it is through the posterior that predictions are made. In the next two sections, we make this difference more precise.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<!-- To contrast these two perspectives, let's study an example that, if you have read some of my previous blog posts (for example, [this](http://127.0.0.1:4000/r/Regularization.html) one), you will be thoroughly familiar with: coin flips. Assume we observe $y = 5$ heads out of $n = 10$ coin flips. We wish to compare the following two models -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \mathcal{M}_0&: \theta = 0.50 \\[1em] -->
<!-- \mathcal{M}_1&: \theta \sim \text{Beta}(1, 1) -->
<!-- \end{aligned} -->
<!-- $$ -->
<h2 id="prior-prediction-marginal-likelihoods">Prior prediction: Marginal likelihoods</h2>
<p>From this perspective, we weight the model’s prediction of the observed data given a particular parameter setting by the prior. This is accomplished by integrating the likelihood with respect to the prior, which gives the so-called <em>marginal likelihood</em> of a model $\mathcal{M}$:</p>
<script type="math/tex; mode=display">p(y \mid \mathcal{M}) = \int_{\Theta} p(y \mid \theta, \mathcal{M}) \, p(\theta \mid \mathcal{M}) \, \mathrm{d}\theta \enspace .</script>
<p>It is clear that the prior matters a great deal, but that is no surprise — it is part of the model. A bad prior means a bad model. The ratio of two such marginal likelihoods is known as the Bayes factor.</p>
<p>If one is willing to assign priors to models, one can compute posterior model probabilities, i.e.,</p>
<script type="math/tex; mode=display">\begin{equation}
p(\mathcal{M}_k \mid y) = p(\mathcal{M}_k) \times \frac{p(y \mid \mathcal{M}_k)}{\sum_{i=1}^K p(y \mid \mathcal{M}_i) \, p(\mathcal{M}_i)} \enspace ,
\end{equation}</script>
<p>where $K$ is the number of models under consideration (see also a <a href="https://fdabl.github.io/r/Spike-and-Slab.html">previous</a> blogpost). Observe that the marginal likelihood features prominently: it is an updating factor from prior to posterior model probability. With this, one can also compute <em>Bayesian model-averaged</em> predictions:</p>
<script type="math/tex; mode=display">p(\tilde{y} \mid y) = \sum_{k=1}^K p(\tilde{y} \mid y, \mathcal{M}_k) \, \underbrace{p(\mathcal{M}_k \mid y)}_{w_k} \enspace ,</script>
<p>where $\tilde{y}$ is unseen data, and where the prediction of each model gets weighted by its posterior probability. We denote a model weight, which in this case is its posterior probability, as $w_k$.</p>
<h2 id="posterior-prediction-leave-one-out-cross-validation">Posterior prediction: Leave-one-out cross-validation</h2>
<p>Another perspective aims to estimate the expected out-of-sample prediction error, or expected log predictive density, i.e.,</p>
<script type="math/tex; mode=display">\text{elpd}^{\mathcal{M}} = \mathbb{E}_{\tilde{y}} \left(\text{log} \, p(\tilde{y} \mid y) \right) \enspace ,</script>
<p>where the expectation is taken with respect to unseen data $\tilde{y}$ (which is visualized as a grey surface with an $?$ inside in the figure above).</p>
<p>Clearly, as we do not have access to unseen data, we cannot evaluate this. However, one can approximate this quantity by computing the leave-one-out prediction error in our sample:</p>
<script type="math/tex; mode=display">\widehat{\text{elpd}}^{\mathcal{M}}_{\text{loo}} = \frac{1}{n} \sum_{i=1}^n \, \text{log} \, p(y_i \mid y_{-i}) \approx \mathbb{E}_{\tilde{y}} \left(\text{log} \, p(\tilde{y} \mid y) \right) \enspace,</script>
<p>where $y_i$ is the $i^{\text{th}}$ data point, and $y_{-i}$ are all data points except $y_i$, where we have suppressed conditioning on $\mathcal{M}$ to not clutter notation (even more), and where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y_i \mid y_{-i}) &= \int_{\Theta} p(y_i, \theta \mid y_{-i}) \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} p(y_i \mid y_{-i}, \theta) \, p(\theta \mid y_{-i}) \, \mathrm{d}\theta \enspace .
\end{aligned} %]]></script>
<p>Observe that this requires integrating over the <em>posterior distribution</em> of $\theta$ given all but one data point; this is in contrast to the marginal likelihood perspective, which requires integration with respect to the <em>prior distribution</em>. From this perspective, one can similarly compute model weights $w_k$</p>
<script type="math/tex; mode=display">w_k = \frac{\text{exp}\left(\widehat{\text{elpd}}^{\mathcal{M}_k}_{\text{loo}}\right)}{\sum_{i=1}^K \text{exp}\left(\widehat{\text{elpd}}^{\mathcal{M}_i}_{\text{loo}}\right)} \enspace ,</script>
<p>where $\widehat{\text{elpd}}_{\text{loo}}^{\mathcal{M}_k}$ is the loo estimate for the expected log predictive density for model $\mathcal{M}_k$. For prediction, one averages across models using these <em>Pseudo Bayesian model-averaging</em> weights (Yao et al., 2018, p. 92).</p>
<p>However, Yao et al. (2018) and Vehtari et al. (2019) recommend against using these Pseudo BMA weights, as they do not take the uncertainty of the loo estimates into account. Instead, they suggest using Pseudo-BMA+ weights or stacking. For details, see Yao et al. (2018).</p>
<p>For an illuminating discussion about model selection based on marginal likelihoods or leave-one-out cross-validation, see Gronau & Wagenmakers (<a href="https://link.springer.com/article/10.1007/s42113-018-0011-7">2019a</a>), Vehtari et al. (<a href="https://link.springer.com/article/10.1007/s42113-018-0020-6">2019</a>), and Gronau & Wagenmakers (<a href="https://link.springer.com/article/10.1007/s42113-018-0022-4">2019b</a>).</p>
<p>Now that we have taken a look at these two perspectives on prediction, in the next section, we will implement the power law and the exponential model in Stan.</p>
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{PSBF}_{10} &= \frac{\prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_1)}{\prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_0)} \\[1em] -->
<!-- &= \text{exp}\left(\text{log} \, \prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_1) - \text{log} \, \prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_0)\right) \\[.5em] -->
<!-- &= \text{exp}\left(\sum_{i=1}^n \text{log} \, p(y_i \mid y_{-i}, \mathcal{M}_1) - \sum_{i=1}^n \text{log} \, p(y_i \mid y_{-i}, \mathcal{M}_0)\right) \\[.5em] -->
<!-- &= \text{exp}\left(\hat{\text{elpd}}^{\mathcal{M}_1}_{\text{loo}} - \hat{\text{elpd}}^{\mathcal{M}_0}_{\text{loo}} \right) \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<h1 id="implementation-in-stan">Implementation in Stan</h1>
<p>As is common in psychology, people do not deterministically follow a power law or an exponential law. Instead, the law is probabilistic: given the same task, the person will respond faster or slower, never exactly as before. To allow for this, we assume that there is Gaussian noise around the function value. In particular, we assume that</p>
<script type="math/tex; mode=display">\text{RT}_N \sim \mathcal{N}\left(\alpha + \beta N^{-r}, \sigma_e^2\right) \enspace .</script>
<p>Note that are not normally distributed; we address this later. We make the same assumption for the exponential model. The following code implements the power law model in Stan.</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">// In the data block, we specify everything that is relevant for
// specifying the data. Note that PRIOR_ONLY is a dummy variable used later.
data {
int<lower=1> n;
real y[n];
int<lower=0, upper=1> PRIOR_ONLY;
}
// In the parameters block, we specify all parameters we need.
// Although Stan implicitly adds flat priors on the (positive) real line
// we will specify informative priors below.
parameters {
real<lower=0> r;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> sigma_e;
}
// In the model block, we specify our informative priors and
// the likelihood of the model, unless we want to sample only from
// the prior (i.e., if PRIOR_ONLY == 1)
model {
target += lognormal_lpdf(alpha | 0, .5);
target += lognormal_lpdf(beta | 1, .5);
target += gamma_lpdf(r | 1, 3);
target += gamma_lpdf(sigma_e | 0.5, 5);
if (PRIOR_ONLY == 0) {
for (trial in 1:n) {
target += normal_lpdf(y[trial] | alpha + beta * trial^(-r), sigma_e);
}
}
}
// In this block, we make posterior predictions (ypred) and compute
// the log likelihood of each data point (log_lik)
// which is needed for the computation of loo later
generated quantities {
real ypred[n];
real log_lik[n];
for (trial in 1:n) {
ypred[trial] = normal_rng(alpha + beta * trial^(-r), sigma_e);
log_lik[trial] = normal_lpdf(y[trial] | alpha + beta * trial^(-r), sigma_e);
}
}</code></pre></figure>
<p>From a marginal likelihood perspective, the prior is an integral part of the model; this means we have to think very carefully about it. There are several principles that can guide us (see also Lee & Vanpaemel, 2018), but one that is particularly helpful here is to look at the prior predictive distribution. Do draws from the prior predictive distribution look like what we had in mind? Below, I have visualized the mean, the standard deviation around the mean, and several draws from it for (a) flat priors on the positive real line, and (b) informed priors that I chose based on reading Evans et al. (2018). In the Stan code, you can specify flat priors by commenting out the priors we have specified in the model block.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>The Figure on the left shows that flat priors make terrible predictions. The mean of the prior predictive distribution is a constant function at zero, which is not at all what we had in mind when writing down the power law model. Even worse, flat priors allow for negative reaction time, something that is clearly impossible! In contrast, the Figure on the right seems reasonable. Below I have visualized the informed priors.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>From a cross-validation perspective, priors do not matter <em>that</em> much; prediction is conditional on the observed data, and so the prior is transformed to a posterior before the model makes predictions. If the prior is not too misspecified, or if we have a sufficient amount of data so that the prior has only a weak influence on the posterior, a model’s posterior predictions will not markedly depend on it.</p>
<h1 id="practical-model-comparison">Practical model comparison</h1>
<p>Let’s check whether we select the correct model for the power law and the exponential data, respectively. I have generated the data above using the following code, which will make sense later in the blog post.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sim_power</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sdlog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rlnorm</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">N</span><span class="p">),</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">N</span><span class="o">^</span><span class="p">(</span><span class="o">-</span><span class="n">r</span><span class="p">)),</span><span class="w"> </span><span class="n">sdlog</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">sim_exp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sdlog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rlnorm</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">N</span><span class="p">),</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="o">-</span><span class="n">r</span><span class="o">*</span><span class="n">N</span><span class="p">)),</span><span class="w"> </span><span class="n">sdlog</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">30</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">xp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sim_power</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="n">xe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sim_exp</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">.2</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span></code></pre></figure>
<p>We use the <em>bridgesampling</em> and the <em>loo</em> package to estimate Bayes factors and loo scores and stacking weights, respectively.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'loo'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'bridgesampling'</span><span class="p">)</span><span class="w">
</span><span class="n">comp_power</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s1">'stan-compiled/compiled-power-model.RDS'</span><span class="p">)</span><span class="w">
</span><span class="n">comp_exp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s1">'stan-compiled/compiled-exponential-model.RDS'</span><span class="p">)</span><span class="w">
</span><span class="c1"># power model data</span><span class="w">
</span><span class="n">fit_pp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_power</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xp</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ep</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xp</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># exponential data</span><span class="w">
</span><span class="n">fit_pe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_power</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xe</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ee</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xe</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>But before we do so, let’s visualize the posterior predictions of both models for each simulated data set, respectively.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>In the figure on the left, we see that the posterior predictions for the exponential model have a larger variance compared to the predictions of the power law model. Conversely, on the right it seems that the exponential model gives the better predictions.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p>We first compare the two models on the power law data, using the Bayes factor and loo. With the former, we find overwhelming evidence in favour of the power law model.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_pp</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ep</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 54950.14619</code></pre></figure>
<p>Note that this estimate can vary with different runs (since we were stingy in sampling from the posterior). The comparison using loo yields the following output:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_pp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_pp</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_ep</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ep</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo_pp</span><span class="p">,</span><span class="w"> </span><span class="n">loo_ep</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -14.2 5.4</code></pre></figure>
<p>Note that the best model is always on top, and the comparison is already on the difference score. Following a two standard error heuristic (but see <a href="https://discourse.mc-stan.org/t/interpreting-output-from-compare-of-loo/3380/4">here</a>), since the difference in the elpd scores is more than twice its standard error, we would choose the power law model as the better model. <strong>But wait</strong> – what do these warnings mean? Let’s look at the output of the loo function for the exponential law (suppressing the warning):</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_ep</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">##
## Computed from 16000 by 30 log-likelihood matrix
##
## Estimate SE
## elpd_loo 22.9 5.7
## p_loo 7.4 3.9
## looic -45.8 11.5
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 28 93.3% 6801
## (0.5, 0.7] (ok) 1 3.3% 369
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 1 3.3% 9
## See help('pareto-k-diagnostic') for details.</code></pre></figure>
<p>We find that there are is one very bad Pareto $k$ value. What does this mean? There are many details about loo that you can read up on in Vehtari et al. (2018). Put briefly, to efficiently compute the loo score of a model, Vehtari et al. (2018) use <em>importance sampling</em> to compute the predictive density, which requires finding importance weights to better approximate this density. These importance weights are known to be unstable, and the authors introduce a particular stabilizing transformation which they call “Pareto smoothed importance sampling” (PSIS). The parameter $k$ is the shape parameter of this (generalized) Pareto distribution. If it is high, such as $k > 0.7$ as we find for one data point here, then this implies unstable estimates — we should probably not trust the loo estimate for this model.<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup> Similarly pathological behaviour can be diagnosed by p_loo, which gives an estimate of the effective number of parameters. In this case, this is about double the number of actual parameters ($\alpha, \beta, r, \sigma_e^2$).</p>
<p>The two figures below visualize the $k$ values for each data point. We see that loo has troubles predicting the first data point for both the exponential and power law model.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<p>We can also compute the stacking weights; but again, one should probably not trust these estimates:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fit1</span><span class="p">,</span><span class="w"> </span><span class="n">fit2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">log_lik_list</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="s1">'1'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit1</span><span class="p">),</span><span class="w">
</span><span class="s1">'2'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">r_eff_list</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="s1">'1'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">relative_eff</span><span class="p">(</span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit1</span><span class="p">,</span><span class="w"> </span><span class="n">merge_chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)),</span><span class="w">
</span><span class="s1">'2'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">relative_eff</span><span class="p">(</span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit2</span><span class="p">,</span><span class="w"> </span><span class="n">merge_chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">loo_model_weights</span><span class="p">(</span><span class="n">log_lik_list</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'stacking'</span><span class="p">,</span><span class="w"> </span><span class="n">r_eff_list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r_eff_list</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_pp</span><span class="p">,</span><span class="w"> </span><span class="n">fit_ep</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 0.995
## 2 0.005</code></pre></figure>
<p>We could compute a “Stacked Pseudo Bayes factor” by taking the ratio of these two weights to see how much more weight one model is given compared to the other.<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> This yields a factor of about 120 in favour of the power law model.</p>
<p>We can make the same comparisons using the exponential data. The Bayes factor again yields overwhelming support for the model that is closer to the true model, i.e., in this case the exponential model.<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup> But note that the evidence is an order of magnitude smaller than in the above comparison.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_pe</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 1024.51382</code></pre></figure>
<p>This marked decrease in evidence is also tracked by loo, which now tells us that we cannot reliably distinguish between the two models<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">),</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_pe</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -8.3 5.6</code></pre></figure>
<p>Note that while we still get warnings, this time we only have one data point with $k \in [0.7, 1]$, which is bad, but not very bad. The stacking weights show that there is not a clear winner, with a factor of only about 6 in favour of the exponential model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">,</span><span class="w"> </span><span class="n">fit_pe</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 0.847
## 2 0.153</code></pre></figure>
<p>Now that we have seen how one might compare these models in practice, in the next two sections, we will see how we can extend the model to be more realistic. To that end, I will focus only on the exponential model. In fact, the exponential model is what many researchers now prefer as the “law of practice”. In a very influential article, Newell & Rosenbloom (1981) found that a power law fit best for data from a wide variety of tasks. While they relied on averaged data, Heathcote et al. (2000) looked at participant-specific data. They found that the decrease in reaction time follows an exponential function, implying that previous results were biased due to averaging; in fact, one can show that the (arithmetic) averaging of many exponential (i.e., non-linear) functions can lead to a group-level power law when individual differences exist (Myung, Kim, & Pitt, 2000).</p>
<h1 id="extension-i-modeling-plateaus">Extension I: Modeling plateaus</h1>
<p>The above two models assume that participants <em>get it</em> from the first trial onwards, and become better immediately. However, real data often exhibits a <em>plateau</em>. We simulate such data using the following lines of code.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
</span><span class="n">rlnorm</span><span class="p">(</span><span class="m">15</span><span class="p">,</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">),</span><span class="w">
</span><span class="n">sim_exp</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-17-1.png" title="plot of chunk unnamed-chunk-17" alt="plot of chunk unnamed-chunk-17" style="display: block; margin: auto;" /></p>
<p>How can we model this? Evans et al. (2018) suggest introducing a single parameter, $\tau$, and adjusting the model as follows:</p>
<script type="math/tex; mode=display">f_e(N) = \alpha + \beta \, \frac{\tau + 1}{\tau + e^{rN}} \enspace .</script>
<p>Observe that for $\tau = 0$, we recover the original exponential model. For $\tau \rightarrow \infty$, the function becomes a constant function $\alpha + \beta$. Thus, large values for $\tau$ (and large values for $r$, so as to model the steep drop in reaction time) allow us to model the initial plateau we find in real data.</p>
<p>We can adjust the model in Stan easily:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">data {
int<lower=1> n;
real y[n];
int<lower=0, upper=1> PRIOR_ONLY;
}
parameters {
real<lower=0> r;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> sigma_e;
real<lower=0> tau;
}
model {
real mu;
target += cauchy_lpdf(tau | 0, 1);
target += lognormal_lpdf(alpha | 0, .5);
target += lognormal_lpdf(beta | 1, .5);
target += gamma_lpdf(r | 1, 3);
target += gamma_lpdf(sigma_e | 0.5, 5);
if (PRIOR_ONLY == 0) {
for (trial in 1:n) {
mu = alpha + beta * (tau + 1) / (tau + exp(r * trial));
target += normal_lpdf(y[trial] | mu, sigma_e);
}
}
}
generated quantities {
real mu;
real ypred[n];
real log_lik[n];
for (trial in 1:n) {
mu = alpha + beta * (tau + 1) / (tau + exp(r * trial));
ypred[trial] = normal_rng(mu, sigma_e);
log_lik[trial] = normal_lpdf(y[trial] | mu, sigma_e);
}
}</code></pre></figure>
<p>We have put a half-Cauchy prior on $\tau$. This is because the Cauchy distribution has very fat tails, compared to for example the Normal or the Laplace distribution; see Figure below. This is desired, because we need large $\tau$ values to accommodate plateaus.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-19-1.png" title="plot of chunk unnamed-chunk-19" alt="plot of chunk unnamed-chunk-19" style="display: block; margin: auto;" /></p>
<p>The figure below compares the prior predictive distributions of the exponential to the <em>delayed exponential</em> model. As we can see, the additional $\tau$ parameter creates larger uncertainty in the predictions, with the some individual draws looking completely different from each other. <a href="https://betanalpha.github.io/assets/case_studies/fitting_the_cauchy.html">Drawing samples from a Cauchy</a>, whose mean and variance are <a href="https://en.wikipedia.org/wiki/Cauchy_distribution#Explanation_of_undefined_moments">undefined</a>, is tricky.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-20-1.png" title="plot of chunk unnamed-chunk-20" alt="plot of chunk unnamed-chunk-20" style="display: block; margin: auto;" /></p>
<p>However, if we compare the two models to each other on the plateau data set, we see that the extended model predicts the data better:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit_ee</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ee_plateau</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp_delayed</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 152.06430</code></pre></figure>
<p>This is a large Bayes factor. However, loo seems to favour the delayed exponential model much more, by about three standard errors:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">),</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -10.4 3.3</code></pre></figure>
<p>This is also reflected in an extreme difference in stacking weights, which completely discounts the standard exponential model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">,</span><span class="w"> </span><span class="n">fit_ee</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 1.000
## 2 0.000</code></pre></figure>
<p>Earlier, when comparing the exponential and power model on data generated from an exponential model, we found a Bayes factor in favour of the exponential model of about 1000, while the difference in loo was only about 1.5 standard errors. Here, we now find a Bayes factor in favour of the delayed exponential model of only about 150, while loo finds a difference of about three standard errors. This contrast becomes is illuminated by visualizing the posterior predictive distribution, see below.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-24-1.png" title="plot of chunk unnamed-chunk-24" alt="plot of chunk unnamed-chunk-24" style="display: block; margin: auto;" /></p>
<p>We see that the delayed exponential model seems to “fit” the data much better.<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup> Since loo basically uses this posterior predictive distribution (except that it removes one data point) to predict individual data points, it now seems clearer why it should favour the delayed exponential model so much more strongly. In contrast, the prior predictive distributions of the exponential and the delayed exponential (visualized above) are rather similar. Since the Bayes factor evaluated the probability of the data given these prior predictive distributions, the evidence in favour of the delayed exponential model is only modest.</p>
<h1 id="extension-ii-modeling-reaction-times">Extension II: Modeling reaction times</h1>
<h2 id="the-lognormal-distribution">The Lognormal distribution</h2>
<p>Reaction times are well known to be non-normally distributed. One popular distribution for them is the <em>Lognormal distribution</em>. Let $X \sim \mathcal{N}(\mu, \sigma^2)$. Then $Z = e^X$ is lognormally distributed. To arrive at its density function, we do a change of variables. Observe that</p>
<script type="math/tex; mode=display">P_z(Z \leq z) = P_z\left(e^X \leq z\right) = P_x(X \leq \text{log}\,z) = F_x(\text{log}\,z) \enspace ,</script>
<p>where $F_x$ is the cumulative distribution function of $X$. Differentiating with respect to $z$ yields the probability density function for $Z$:</p>
<script type="math/tex; mode=display">p_z(z) = \frac{P_z(Z \leq z)}{\mathrm{d} z} = \frac{F_x(\text{log}\,z)}{\mathrm{d} z} = p_x(\text{log}\,z) \left|\frac{\text{log}\,z}{\mathrm{d}z}\right| = p_x(\text{log}\,z) \frac{1}{z} \enspace ,</script>
<p>which spelled out is</p>
<script type="math/tex; mode=display">p_z(z) = \frac{1}{\sqrt{2\pi\sigma^2}} \text{exp}\left(-\frac{1}{2\sigma^2} (\text{log}\,z - \mu)^2\right) \frac{1}{z} \enspace .</script>
<p>The figure below visualizes various lognormal distribution with different parameters $\mu$ and $\sigma^2$. The figure on the left shows how a change in $\sigma^2$ affects the distribution, while keeping $\mu = 1$. You can see that the “peak” of the distribution changes, indicating that the parameter $\mu$ is not independent of $\sigma^2$. In fact, while the mean and variance of a normal distribution are given by $\mu$ and $\sigma^2$, respectively, this is not so for a lognormal distribution. It seems difficult to compute the first two moments of the Lognormal distributions directly (you can try if you want!). However, there is a neat trick to compute <em>all</em> its moments basically instantaneously.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-25-1.png" title="plot of chunk unnamed-chunk-25" alt="plot of chunk unnamed-chunk-25" style="display: block; margin: auto;" /></p>
<p>To do this, observe that the <a href="https://en.wikipedia.org/wiki/Moment-generating_function"><em>moment generating function</em></a> (MGF) of a random variable $X$ is given by</p>
<script type="math/tex; mode=display">M_X(t) := \mathbb{E}\left[e^{tX}\right] \enspace .</script>
<p>Now, we’re not going to use the MGF of the Lognormal — in fact, because the integral diverges, it does not exist. Instead, we’ll use the MGF of a Normal distribution. Since $Z = e^X$, we can write the $t^{\text{th}}$ moment such that</p>
<script type="math/tex; mode=display">\mathbb{E}\left[Z^t\right] = \mathbb{E}\left[e^{tX}\right] = M_X(t) = \text{exp}\left(t\mu + \frac{1}{2}t^2 \sigma^2\right) \enspace ,</script>
<p>where the last term is the MGF of a normal distribution (see also Blitzstein & Hwan, 2014, p. 260-261). Thus, the mean and variance of a Lognormal distribution are given by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[Z] &= \text{exp}\left(\mu + \frac{1}{2} \sigma^2\right) \\[.5em]
\text{Var}[Z] &= \mathbb{E}[Z^2] - \mathbb{E}[Z]^2 \\[.5em]
&= \text{exp}\left(2\mu + 2\sigma^2\right) - \text{exp}\left(2\mu + \sigma^2\right) \\[.5em]
&= \text{exp}\left(2\mu + \sigma^2\right) \left(\text{exp}\left(\sigma^2\right) - 1 \right) \enspace .
\end{aligned} %]]></script>
<p>This dependency between mean and variance is desired. In particular, it is well established that changes in mean reaction times are accompanied by proportional changes in the standard deviation (Wagenmakers & Brown, 2007).</p>
<p><a href="https://fdabl.github.io/statistics/Two-Properties.html">In contrast</a> to the Normal distribution, the Lognormal distribution is not closed under addition. This means that if $Z$ has a Lognormal distribution, $Z + \delta$ does not necessarily have a Lognormal distribution anymore. However, we are interested in modeling <em>shifts</em> in reaction times. For example, there is a reaction time $\alpha$ faster which participants cannot meaningfully respond. To allow for such shifts, we expand the Lognormal distribution by a parameter $\delta$ such that $Z = e^{X - \delta}$.<sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup> This leads to the following density function</p>
<script type="math/tex; mode=display">p_z(z) = \frac{1}{\sqrt{2\pi\sigma^2}} \text{exp}\left(-\frac{1}{2\sigma^2} (\text{log}\,(z - \delta) - \mu)^2\right) \frac{1}{z - \delta} \enspace .</script>
<h2 id="extending-the-model">Extending the model</h2>
<p>We extend the delayed exponential model such that</p>
<script type="math/tex; mode=display">\text{RT} \sim \text{Shifted-Lognormal}\left(\delta, \, \text{log}\left(\alpha' + \beta \frac{\tau + 1}{\tau + e^{rN}}\right), \, \sigma \right) \enspace.</script>
<p>The median of a Shifted-Lognormal distribution is given by $\delta + e^\mu$, which is why we log the main part of the model above. Note that the previous asymptote $\alpha$ is now $\delta + \alpha’$. To be on the same scale as before, we assign $\delta$ and $\alpha’$ a Lognormal distribution with medians $0.50$ and $0.50$, respectively.</p>
<p>We can implement the model in Stan by writing a Shifted-Lognormal probability density function:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">// In the function block, we define our new probability density function
functions {
real shiftlognormal_lpdf(real z, real delta, real mu, real sigma) {
real lprob;
lprob = (
-log((z - delta)*sigma*sqrt(2*pi())) -
(log(z - delta) - mu)^2 / (2*sigma^2)
);
return lprob;
}
real shiftlognormal_rng(real delta, real mu, real sigma) {
return delta + lognormal_rng(mu, sigma);
}
}
// In the data block, we specify everything that is relevant for
// specifying the data. Note that PRIOR_ONLY is a dummy variable used later.
data {
int<lower=1> n;
real y[n];
int<lower=0, upper=1> PRIOR_ONLY;
}
// In the parameters block, we specify all parameters we need.
// Although Stan implicitly adds flat prior on the (positive) real line
// we will specify informative priors below.
parameters {
real<lower=0> r;
real<lower=0> tau;
real<lower=0> delta;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> sigma_e;
}
// In the model block, we specify our informative priors and
// the likelihood of the model, unless we want to sample only from
// the prior (i.e., if PRIOR_ONLY == 1)
model {
real mu;
target += cauchy_lpdf(tau | 0, 1);
target += lognormal_lpdf(delta | log(0.50), .5);
target += lognormal_lpdf(alpha | log(0.50), .5);
target += lognormal_lpdf(beta | 1, .5);
target += gamma_lpdf(r | 1, 3);
target += gamma_lpdf(sigma_e | 0.5, 5);
if (PRIOR_ONLY == 0) {
for (trial in 1:n) {
mu = log(alpha + beta * (tau + 1) / (tau + exp(r*trial)));
target += shiftlognormal_lpdf(y[trial] | delta, mu, sigma_e);
}
}
}
// In this block, we make posterior predictions (ypred) and compute
// the log likelihood of each data point given all the others (log_lik)
// which is needed for the computation of loo later
generated quantities {
real mu;
real ypred[n];
real log_lik[n];
for (trial in 1:n) {
mu = log(alpha + beta * (tau + 1) / (tau + exp(r*trial)));
ypred[trial] = shiftlognormal_rng(delta, mu, sigma_e);
log_lik[trial] = shiftlognormal_lpdf(y[trial] | delta, mu, sigma_e);
}
}</code></pre></figure>
<p>Visualizing the prior predictive distribution bares only limited insight into how these two models differ:</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-27-1.png" title="plot of chunk unnamed-chunk-27" alt="plot of chunk unnamed-chunk-27" style="display: block; margin: auto;" /></p>
<p>The Bayes factor in favour of the lognormal model is quite large:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit_ee_plateau_lognormal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp_delayed_log</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">adapt_delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.9</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee_plateau_lognormal</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 200.86885</code></pre></figure>
<p>In contrast, loo shows barely any evidence: we would <em>not</em> choose the lognormal model, but we remain undecided. (The warning is because we have one $k \in [0.5, 0.7]$.)</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee_plateau_lognormal</span><span class="p">),</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -4.9 3.3</code></pre></figure>
<p>Using stacking, the weights result in a factor of about 24 in favour of the lognormal model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_ee_plateau_lognormal</span><span class="p">,</span><span class="w"> </span><span class="n">fit_ee_plateau</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 0.961
## 2 0.039</code></pre></figure>
<p>The contrast between the Bayes factor and loo can again be illuminated by looking at the posterior predictive distribution for the respective models; see below. The lognormal model can account for decrease in variance with increased practice. Still, the predictions of both models are rather similar, so that there is little to distinguish them using loo.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-31-1.png" title="plot of chunk unnamed-chunk-31" alt="plot of chunk unnamed-chunk-31" style="display: block; margin: auto;" /></p>
<h1 id="modeling-recap">Modeling recap</h1>
<p>Focusing on the exponential model, we have successively made our modeling more sophisticated (see also Evans et al., 2018). Barring priors, we have encountered the following models:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(1) \hspace{1em} &\text{RT} \sim \mathcal{N}\left(\alpha + \beta e^{-rN}, \sigma^2 \right) \\[1em]
(2) \hspace{1em} &\text{RT} \sim \mathcal{N}\left(\alpha + \beta \, \frac{\tau + 1}{\tau + e^{rN}}, \sigma^2 \right) \\[.5em]
(3) \hspace{1em} &\text{RT} \sim \text{Shifted-Lognormal}\left(\delta, \, \text{log}\left(\alpha' + \beta \frac{\tau + 1}{\tau + e^{rN}}\right), \, \sigma \right) \enspace .
\end{aligned} %]]></script>
<p>We went from (1) to (2) to account for learning plateaus; sometimes, participants take a while for learning to really kick in. We went from (2) to (3) to account for the fact that reaction times are decidedly nonnormal, and that there is a linear relationship between the mean and the standard deviation of reaction times.</p>
<p>We have also compared the power law to the exponential function, but so far we have looked only at simulated data. Since this blog post is already quite lengthy, we defer the treatment of real data to a future blog post.</p>
<!-- # Using real data -->
<!-- ```{r, echo = FALSE, fig.width = 10, fig.height = 5, fig.align = 'center', message = FALSE, warning = FALSE} -->
<!-- library('dplyr') -->
<!-- library('ggpubr') -->
<!-- library('ggplot2') -->
<!-- set.seed(1) -->
<!-- dat <- readRDS('stan-compiled/Evans-dat.RDS') -->
<!-- datf <- filter(dat, task == 2, id %in% sample(1:36, replace = TRUE, size = 4)) -->
<!-- ggplot(datf, aes(x = trial, y = RT)) + -->
<!-- geom_point(size = 1, alpha = .4) + -->
<!-- facet_wrap(~ id, scales = 'free', nrow = 2) + theme_pubclean() + -->
<!-- xlab('Trial') + -->
<!-- ylab('Reaction Time (sec)') -->
<!-- ``` -->
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we took a closer look at what has been dubbed the <em>law of practice</em>: the empirical fact that reaction time decreases as an exponential function (previously: power law) with practice. We compared two perspectives on prediction: one based on marginal likelihoods, and one based on leave-one-out cross-validation. The latter allows the model to gorge on data, update its parameters, and then make predictions based on the <em>posterior predictive distribution</em>, while the former forces the model to make predictions using the <em>prior predictive distribution</em>. We have implemented the power law and exponential model in Stan, and extended the latter to model an initial learning plateau and account for the empirical observation that not only mean reaction time decreases, but also its variance.</p>
<hr />
<p><em>I would like to thank Nathan Evans, Quentin Gronau, and Don van den Bergh for helpful comments on this blog post.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Blitzstein, J. K., & Hwang, J. (<a href="https://projects.iq.harvard.edu/stat110/home">2014</a>). <em>Introduction to Probability</em>. London, UK: Chapman and Hall/CRC.</li>
<li>Evans, N. J., Brown, S. D., Mewhort, D. J., & Heathcote, A. (<a href="https://psycnet.apa.org/record/2018-30695-005">2018</a>). Refining the law of practice. <em>Psychological Review, 125</em>(4), 592-605.</li>
<li>Fong, E., & Holmes, C. (<a href="https://arxiv.org/abs/1905.08737">2019</a>). On the marginal likelihood and cross-validation. arXiv preprint arXiv:1905.08737.</li>
<li>Gelman, A., & Shalizi, C. R. (<a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2044-8317.2011.02037.x">2013</a>). Philosophy and the practice of Bayesian statistics. <em>British Journal of Mathematical and Statistical Psychology, 66</em>(1), 8-38.</li>
<li>Gronau, Q. F., & Wagenmakers, E. J. (<a href="https://link.springer.com/article/10.1007/s42113-018-0011-7">2019a</a>). Limitations of Bayesian leave-one-out cross-validation for model selection. <em>Computational Brain & Behavior, 2</em>(1), 1-11.</li>
<li>Gronau, Q. F., & Wagenmakers, E. J. (<a href="https://link.springer.com/article/10.1007/s42113-018-0022-4">2019b</a>). Rejoinder: More limitations of Bayesian leave-one-out cross-validation. Computational brain & behavior, 2(1), 35-47.</li>
<li>Heathcote, A., Brown, S., & Mewhort, D. J. K. (<a href="https://link.springer.com/article/10.3758/BF03212979">2000</a>). The power law repealed: The case for an exponential law of practice. <em>Psychonomic bulletin & Review, 7</em>(2), 185-207.</li>
<li>Lee, M. D., & Vanpaemel, W. (<a href="https://link.springer.com/article/10.3758/s13423-017-1238-3">2018</a>). Determining informative priors for cognitive models. <em>Psychonomic Bulletin & Review, 25</em>(1), 114-127.</li>
<li>Lee, M. D. (<a href="https://osf.io/zky2v/">2018</a>). Bayesian methods in cognitive modeling. <em>Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience, 5</em>, 1-48.</li>
<li>Myung, I. J., Kim, C., & Pitt, M. A. (<a href="https://link.springer.com/article/10.3758/BF03198418">2000</a>). Toward an explanation of the power law artifact: Insights from response surface analysis. <em>Memory & Cognition, 28</em>(5), 832-840.</li>
<li>Newell, A. (1973). You can’t play 20 questions with nature and win: Projective comments on the papers of this symposium. In W. G. Chase (Ed.), <em>Visual information processing</em> (pp. 283-308). New York, US: Academic Press.</li>
<li>Newell, A., & Rosenbloom, P. S. (1981). Mechanisms of skill acquisition and the law of practice. In J.R. Anderson (Ed.), <em>Cognitive skills and their acquisition</em> (pp. 1-55). Hillsdale, NJ: Erlbaum.</li>
<li>Vehtari, A., Gelman, A., & Gabry, J. (<a href="https://arxiv.org/abs/1507.02646">2015</a>). Pareto smoothed importance sampling. arXiv preprint arXiv:1507.02646.</li>
<li>Vehtari, A., Simpson, D. P., Yao, Y., & Gelman, A. (<a href="https://link.springer.com/article/10.1007/s42113-018-0020-6">2019</a>). Limitations of “Limitations of Bayesian leave-one-out cross-validation for model selection”. <em>Computational Brain & Behavior, 2</em>(1), 22-27.</li>
<li>Vehtari, A., Gelman, A., & Gabry, J. (<a href="https://link.springer.com/article/10.1007/s11222-016-9696-4">2017</a>). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. <em>Statistics and Computing, 27</em>(5), 1413-1432.</li>
<li>Wagenmakers, E. J., & Brown, S. (<a href="http://www.ejwagenmakers.com/2007/WagenmakersBrown2007.pdf">2007</a>). On the linear relation between the mean and the standard deviation of a response time distribution. <em>Psychological Review, 114</em>(3), 830-841.</li>
<li>Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (<a href="https://projecteuclid.org/euclid.ba/1516093227">2018</a>). Using stacking to average Bayesian predictive distributions (with discussion). <em>Bayesian Analysis, 13</em>(3), 917-1003.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<!-- There are at least two ways to see this. For simplicity, let $\alpha = 0$. Taking logarithms on both sides, the two equations become: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{log} \, f_e(N) &= \text{log} \, \beta - r N \hspace{3em} (1) \\[.5em] -->
<!-- \text{log} \, f_p(N) &= \text{log} \, \beta - r \, \text{log} \, N \hspace{1.4em} (2) -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- We see that the power law model (2) is a linear function in $r$ (on the log scale). The exponential model (1), in contrast, is a non-linear function in $r$ (on the log scale), converging to the asymptote much quicker than the power law model. Moreover, the exponential model allows only for a constant difference between the reaction times on different trials (on the log scale); the difference in log reaction time on one trial and the trial right after is $r$. In contrast, the power model (2) allows that the difference scales in $N$. In the beginning of the practice trials, the difference between trials is comparatively large; for example, $\text{log}(1) - \text{log}(2) = -0.69$. With increasing $N$, the differences between the log reaction times gets smaller and smaller; for example, $\text{log}(10) - \text{log}(11) = -0.10$. Therefore, learning slows down with increased practice. -->
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This blog post is heavily based on the modeling work of Evans et al. (2018). If you want to know more, I encourage you to check out the paper! <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>You may also rewrite the power law as an exponential function, i.e., $f_p(N) = \alpha + \beta e^{-r \, \text{log} N}$, to see their algebraic difference more clearly. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>As it turns out, there is a connection between the two: the “[…] the marginal likelihood is formally equivalent to exhaustive leave-$p$-out cross-validation averaged over all values of $p$ and all held-out test sets when using the log posterior predictive probability as a scoring rule.” (Fong & Holmes, 2019). <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>Note that, if a simpler model is not ruled out by the data, then it might be reasonable to obtain evidence in favour of it, even though a more complex model has generated the data. For $n \rightarrow \infty$, we would probably still want to have consistent model selection; that is, select the model which actually generated the data, assuming that it is in the set of models we are considering. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>For data points for which $k > 0.7$, Vehtari et al. (2017) suggest to compute the predictive density directly, instead of relying on Pareto smoothed importance sampling. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>I am not sure whether Vehtari would like this “Stacked Pseudo Bayes factor”. <a href="https://discourse.mc-stan.org/t/interpreting-elpd-diff-loo-package/1628/2">Here</a>, he seems to suggest that one should choose a model when it’s stacking weight is 1. Otherwise, I suppose his philosophy is aligned with that of Gelman & Shalizi (2013), i.e., expand the model so that it can account for whatever the other model does better. Update: <a href="https://twitter.com/avehtari/status/1134121009282539521">Here’s</a> Vehtari himself. Three cheers for the web! <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>Although there is a temporal order in the data, as trial 21 cannot, for example, come before trial 12, we are not particularly interested in predicting the future using past observations. Therefore, using vanilla loo seems adequate. If one is interested in predicting future observations, one could use approximate <em>leave-future-out</em> cross-validation, see Bürkner, Gabry, & Vehtari (<a href="https://arxiv.org/abs/1902.06281">2019</a>). <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>Michael Lee argues that this idea of trying to recover the “true” model is misguided. Bayes’ rule gives the optimal means to select among models. Given a particular data set, if the Bayes factor favours a model that did, in fact, not generate the data, it is still correct to select this model. After all, it predicts the data better than the model that generated it. Lee (2018) distinguishes between <em>inference</em> (saying what follows from a model and data) and <em>inversion</em> (recovering truth). <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>As Lee (2018, p. 27) points out, the word “fit” is unhelpful in a Bayesian context. There are no degrees of freedom; once the model is specified, inference follows directly from probability theory. So “updating a model” is better terminology than “fitting a model”. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>One can also do this without writing a custom function by using the standard lognormal function, and then just subtracting the shift parameter $\delta$ in the function call. <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderPractice makes better. And faster. But what exactly is the relation between practice and reaction time? In this blog post, we will focus on two contenders: the power law and exponential function. We will implement these models in Stan and extend them to account for learning plateaus and the fact that, with increased practice, not only the mean reaction time but also its variance decreases. We will contrast two perspectives on predictive model comparison: a (prior) predictive perspective based on marginal likelihoods, and a (posterior) predictive perspective based on leave-one-out cross-validation. So let’s get started!1 Two models We can model the relation between reaction time (in seconds) and the number of practice trials as a power law function. Let $f: \mathbb{N} \rightarrow \mathbb{R}^+$ be a function that maps the number of trials to reaction times. We write where $\alpha$ is a lower bound (one cannot respond faster than that due to processing and motor control limits); $\beta$ is the learning gain from practice with respect to the first trial ($N = 1$); $N$ indexes the particular trial; and $r$ is the learning rate. Similarly, we can write where the parameters have the same interpretation, except that $\beta$ is the learning gain from practice compared to no practice ($N = 0$).2 What is the main difference between those two functions? The exponential model assumes a constant learning rate, while the power model assumes diminishing returns. To see this, let $\alpha = 0$, and ignore for the moment that $N$ is discrete. Taking the derivative for the power law model results in which shows that the local learning rate — the change in reaction time as a function of $N$ — is $-r/N$; it depends on how many trials have been completed previously. The more one has practiced, the smaller the local learning rate $-r / N$. The exponential function, in contrast, shows no such dependency on practice: The figure above visualizes two data sets, generated from either a power law (left) or an exponential model (right), as well as the maximum likelihood fit of both models to these data. It is rather difficult to tell which model performs best just by eyeballing the fit. We thus need to engage in a more formal way of comparing models. Two perspectives on prediction Let’s agree that the best way to compare models is to look at predictive accuracy. Predictive accuracy with respect to what? The figure below illustrates two different answers one might give. The grey shaded surface represents unobserved data; the white island inside is the observed data; the model is denoted by $\mathcal{M}$. On the left, the model makes predictions before seeing any data by means of its prior predictive distribution. The predictive accuracy is then evaluated on the actually observed data. In contrast, on the right, the model makes predictions after seeing the data by means of its posterior predictive distribution. In principle, its predictive accuracy is evaluated on data one does not observe (visualized as the grey area). One can estimate this expected out-of-sample predictive accuracy by cross-validation procedures which partition the observed data into a training and test set. The model only sees the training set, and makes predictions for the unseen test set. One key practical distinction between these two perspectives is how predictions are generated. On the left, predictions are generated from the prior. On the right, the prior first gets updated to the posterior using the observed data, and it is through the posterior that predictions are made. In the next two sections, we make this difference more precise.3 Prior prediction: Marginal likelihoods From this perspective, we weight the model’s prediction of the observed data given a particular parameter setting by the prior. This is accomplished by integrating the likelihood with respect to the prior, which gives the so-called marginal likelihood of a model $\mathcal{M}$: It is clear that the prior matters a great deal, but that is no surprise — it is part of the model. A bad prior means a bad model. The ratio of two such marginal likelihoods is known as the Bayes factor. If one is willing to assign priors to models, one can compute posterior model probabilities, i.e., where $K$ is the number of models under consideration (see also a previous blogpost). Observe that the marginal likelihood features prominently: it is an updating factor from prior to posterior model probability. With this, one can also compute Bayesian model-averaged predictions: where $\tilde{y}$ is unseen data, and where the prediction of each model gets weighted by its posterior probability. We denote a model weight, which in this case is its posterior probability, as $w_k$. Posterior prediction: Leave-one-out cross-validation Another perspective aims to estimate the expected out-of-sample prediction error, or expected log predictive density, i.e., where the expectation is taken with respect to unseen data $\tilde{y}$ (which is visualized as a grey surface with an $?$ inside in the figure above). Clearly, as we do not have access to unseen data, we cannot evaluate this. However, one can approximate this quantity by computing the leave-one-out prediction error in our sample: where $y_i$ is the $i^{\text{th}}$ data point, and $y_{-i}$ are all data points except $y_i$, where we have suppressed conditioning on $\mathcal{M}$ to not clutter notation (even more), and where Observe that this requires integrating over the posterior distribution of $\theta$ given all but one data point; this is in contrast to the marginal likelihood perspective, which requires integration with respect to the prior distribution. From this perspective, one can similarly compute model weights $w_k$ where $\widehat{\text{elpd}}_{\text{loo}}^{\mathcal{M}_k}$ is the loo estimate for the expected log predictive density for model $\mathcal{M}_k$. For prediction, one averages across models using these Pseudo Bayesian model-averaging weights (Yao et al., 2018, p. 92). However, Yao et al. (2018) and Vehtari et al. (2019) recommend against using these Pseudo BMA weights, as they do not take the uncertainty of the loo estimates into account. Instead, they suggest using Pseudo-BMA+ weights or stacking. For details, see Yao et al. (2018). For an illuminating discussion about model selection based on marginal likelihoods or leave-one-out cross-validation, see Gronau & Wagenmakers (2019a), Vehtari et al. (2019), and Gronau & Wagenmakers (2019b). Now that we have taken a look at these two perspectives on prediction, in the next section, we will implement the power law and the exponential model in Stan. Implementation in Stan As is common in psychology, people do not deterministically follow a power law or an exponential law. Instead, the law is probabilistic: given the same task, the person will respond faster or slower, never exactly as before. To allow for this, we assume that there is Gaussian noise around the function value. In particular, we assume that Note that are not normally distributed; we address this later. We make the same assumption for the exponential model. The following code implements the power law model in Stan. This blog post is heavily based on the modeling work of Evans et al. (2018). If you want to know more, I encourage you to check out the paper! ↩ You may also rewrite the power law as an exponential function, i.e., $f_p(N) = \alpha + \beta e^{-r \, \text{log} N}$, to see their algebraic difference more clearly. ↩ As it turns out, there is a connection between the two: the “[…] the marginal likelihood is formally equivalent to exhaustive leave-$p$-out cross-validation averaged over all values of $p$ and all held-out test sets when using the log posterior predictive probability as a scoring rule.” (Fong & Holmes, 2019). ↩Two perspectives on regularization2019-04-15T12:00:00+00:002019-04-15T12:00:00+00:00https://fabiandablander.com/r/Regularization<p>Regularization is the process of adding information to an estimation problem so as to avoid extreme estimates. Put differently, it safeguards against foolishness. Both Bayesian and frequentist methods can incorporate prior information which leads to regularized estimates, but they do so in different ways. In this blog post, I illustrate these two different perspectives on regularization on the simplest example possible — estimating the bias of a coin.</p>
<!-- When I am too lazy to cook porridge, I usually buy bread from the local bakery and have bread with (vegan) butter for breakfast. Assume I am unusually clumsy, and my freshly spread slice of bread slips out of my hand, onto the floor. Did the butter land on the floor? Yes! How can we model this process? -->
<h2 id="modeling-coin-flips">Modeling coin flips</h2>
<p>Let’s say that we are interested in estimating the bias of a coin, which we take to be the probability of the coin showing heads.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> In this section, we will derive the Binomial likelihood — the statistical model that we will use for modeling coin flips. Let $X \in [0, 1]$ be a discrete random variable with realization $X = x$. Flipping the coin once, let the outcome $x = 0$ correspond to tails and $x = 1$ to heads. We use the Bernoulli likelihood to connect the data to the latent parameter $\theta$, which we take to be the bias of the coin:</p>
<script type="math/tex; mode=display">p(x \mid \theta) = \theta^x (1 - \theta)^{1 - x} \enspace .</script>
<p>There is no point in estimating the bias by flipping the coin only once. We are therefore interested in a model that can account for $n$ coin flips. If we are willing to assume that the individual coin flips are <em>independent and identically</em> distributed conditional on $\theta$, we can obtain the joint probability of all outcomes by multiplying the probability of the individual outcomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(x_1, \ldots, x_n \mid \theta) &= \prod_{i=1}^n p(x_i \mid \theta) \\[.5em]
&= \prod_{i=1}^n \theta^{x_i} (1 - \theta)^{1 - x_i} \\[.5em]
&= \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{ \sum_{i=1}^n 1 - x_i} \enspace .
\end{aligned} %]]></script>
<p>For the purposes of estimating the coin’s bias, we actually do not care about the order in which the coins come up heads or tails; we only care about how frequently the coin shows heads or tails out of $n$ throws. Thus, we do not model the individual outcomes $X_i$, but instead model their sum $Y = \sum_{i=1}^n X_i$. We write:</p>
<script type="math/tex; mode=display">p(y \mid \theta) = \theta^{y} (1 - \theta)^{n - y} \enspace ,</script>
<p>where we suppress conditioning on $n$ to not clutter notation. Note that our model is not complete — we need to account for the fact that there are several ways to get $y$ heads out of $n$ throws. For example, we can get $y = 2$ with $n = 3$ in three different ways: $(1, 1, 0)$, $(0, 1, 1)$, and $(1, 0, 1)$. If we were to use the model above, we would underestimate the probability of observing two heads out of three coin tosses by a factor of three.</p>
<p>In general, there are $n!$ possible ways in which we can order the outcomes. To see this, think of $n$ containers. The first outcome can go in any container, the second one in any container but the container which houses the first outcome, and so on, which yields:</p>
<script type="math/tex; mode=display">n \times (n - 1) \times (n - 2) \ldots \times 1 = n! \enspace .</script>
<p>However, we only care about $y$ of them, so we need to remove the remaining $(n - y)!$ possible ways. Moreover, once we have taken $y$ outcomes, we do not care about <em>their</em> order; thus we remove another $y!$ permutations. Therefore, for any particular sequence of coin flips of length $n$, there are</p>
<script type="math/tex; mode=display">\frac{n!}{y!(n - y)!} = {n \choose y}</script>
<p>ways to get $y$ heads out of $n$ throws. The funny looking symbol on the right is the <em>Binomal coefficient</em>. The probability of the data is therefore given by the Binomial likelihood:</p>
<script type="math/tex; mode=display">p(y \mid \theta) = {n \choose y} \theta^y (1 - \theta)^{n - y} \enspace ,</script>
<p>which just adds the term ${n \choose y}$ to the equation we had above after introducing $Y$. For the example of observing $y = 2$ heads out of $n = 3$ coin flips, the Binomial coefficient is ${3 \choose 2} = 3$, which accounts for the fact that there are three possible ways to get two heads out of three throws.</p>
<h2 id="the-data">The data</h2>
<p>Assume we flip the coin three times, $n = 3$, and observe three heads, $y = 3$. How can we estimate the bias of the coin? In the next sections, we will use the Binomial likelihood derived above and discuss three different ways of estimating the coin’s bias: maximum likelihood estimation, Bayesian estimation, and penalized maximum likelihood estimation.</p>
<h2 id="classical-estimation">Classical estimation</h2>
<p>Within the frequentist paradigm, the method of maximum likelihood is arguably the most popular method for parameter estimation: choose as an estimate for $\theta$ the value which maximizes the likelihood of the data.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> To get a feeling for how the likelihood of the data differs across values of $\theta$, let’s pick two values, $\theta_1 = .5$ and $\theta_2 = 1$, and compute the likelihood of observing three heads out of three coin flips:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y = 3 \mid \theta = .5) &= {3 \choose 3} .5^3 (1 - .5)^{3 - 3} = 0.125 \\[.5em]
p(y = 3 \mid \theta = 1) &= {3 \choose 3} 1^3 (1 - 1)^{3 - 3} = 1 \enspace .
\end{aligned} %]]></script>
<p>We therefore conclude that the data are more likely for a coin that has bias $\theta_1 = 1$ than for a coin that has bias $\theta_2 = 0.5$. But is it the <em>most</em> likely value? To compare all possible values for $\theta$ visually, we plot the likelihood as a function of $\theta$ below. The left figure shows that, indeed, $\theta = 1$ maximizes the likelihood for the data. The right figure shows the likelihood function for $y = 15$ heads out of $n = 20$ coin flips. Note that, in contrast to probabilities, which need to sum to one, likelihoods do not have a natural scale.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>Do these two examples allow us to derive a general principle for how to estimate the bias of a coin? Let $\hat{\theta}$ denote an estimate of the population parameter $\theta$. The two figures above suggests that $\hat{\theta} = \frac{y}{n}$ is the maximum likelihood estimate for an arbitrary data set $d = (y, n)$ … and it is! To arrive at this mathematically, we can find the maximum of this likelihood function by taking the derivative with respect to $\theta$, and setting it to zero (see also a <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">previous</a> post). In other words, we solve for the value of $\theta$ for which the derivative does not change; and since the Binomial likelihood is unimodal, this maximum will be unique. Note the value for $\theta$ at which the likelihood function has its maximum does not change when we take logs, but because the mathematics is greatly simplified, we do so:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= \frac{\partial}{\partial \theta}\text{log}\left({n \choose y} \theta^y (1 - \theta)^{n - y}\right) \\[.5em]
0 &= \frac{\partial}{\partial \theta}\left(\text{log}{n \choose y} + y \, \text{log}\theta + (n - y) \, \text{log}(1 - \theta)\right) \\[.5em]
0 &= \frac{y}{\theta} - \frac{n - y}{1 - \theta}\\[.5em]
\frac{n - y}{1 - \theta} &= \frac{y}{\theta} \\[.5em]
\theta (n - y) &= (1 - \theta) y \\[.5em]
\theta n - \theta y &= y - \theta y \\[.5em]
\theta n &= y \\[.5em]
\theta &= \frac{y}{n} \enspace ,
\end{aligned} %]]></script>
<p>which shows that indeed $\frac{y}{n}$ is the maximum likelihood estimate.</p>
<h2 id="bayesian-estimation">Bayesian estimation</h2>
<p>Bayesians assign priors to parameters in addition to the likelihood, which takes a central role in all statistical paradigms. For this Binomial problem, we assign $\theta$ a Beta prior:</p>
<script type="math/tex; mode=display">p(\theta) = \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \enspace .</script>
<p>As we will see below, this prior allows easy Bayesian updating while being sufficiently flexible in incorporating prior information. The figure below shows different Beta distributions, formalizing our prior belief about values of $\theta$.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>The figure in the top left corner assigns uniform prior plausibility to all values of $\theta$; the figures to its right incorporate a slight bias towards the extreme values $\theta = 1$ and $\theta = 0$. With increasing $a$ and $b,$ the prior becomes more biased towards $\theta = 0.5$; with decreasing $a$ and $b$, the prior becomes biased against $\theta = 0.5$.</p>
<p>As shown in a <a href="https://fdabl.github.io/r/Spike-and-Slab.html">previous</a> blog post, the Beta distribution is <em>conjugate</em> to the Binomial likelihood, which means that the posterior distribution of $\theta$ is again a Beta distribution:</p>
<script type="math/tex; mode=display">p(\theta \mid y) = \frac{1}{\text{B}(a', b')} \theta^{a' - 1} (1 - \theta)^{b' - 1} \enspace ,</script>
<p>where $a’ = a + y$ and $b’ = b + y - n$. Under this conjugate setup, the parameters of the prior can be understood as prior data; for example, if we choose prior parameters $a = b = 1$, then we assume that we have seen one heads and one tails prior to data collection. The figure below shows two examples of such Bayesian updating processes.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>In both cases, we observe $y = 3$ heads out of $n = 3$ coin flips. On the left, we assign $\theta$ a uniform prior. The resulting posterior distribution is proportional to the likelihood (which we have rescaled to fit nicely in the graph) and thus does not appear as a separate line. After we have seen the data, we can compute the posterior mode as our estimate for the most likely value of $\theta$. Observe that the posterior mode is equivalent to the maximum likelihood estimate:</p>
<script type="math/tex; mode=display">\hat{\theta}_{\text{PM}} = \frac{a' - 1}{a' + b' - 2} = \frac{1 + y - 1}{1 + y + 1 + n - y - 2} = \frac{y}{n} = \hat{\theta}_{\text{MLE}} \enspace .</script>
<p>This is in fact the case for all statistical estimation problems where we assign a uniform prior to the (possibly high-dimensional) parameter vector $\theta$. To prove this, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\hat{\theta}_{\text{PM}} &= \underset{\theta}{\text{argmax}} \, \frac{p(y \mid \theta) \, p(\theta)}{p(y)} \\[.5em]
&= \underset{\theta}{\text{argmax}} \, p(y \mid \theta) \\[.5em]
&= \hat{\theta}_{\text{MLE}} \enspace ,
\end{aligned} %]]></script>
<p>since we can drop the normalizing constant $p(y)$, because it does not depend on $\theta$, and $p(\theta)$, because it is a constant assigning all values of $\theta$ equal probability.</p>
<p>Using a Beta prior with $a = b = 2$, as shown on the right side of the figure above, we see that the posterior is not proportional to the likelihood anymore. This in turn means that the mode of the posterior distribution does no longer correspond to the maximum likelihood estimate. In this case, the posterior mode is:</p>
<script type="math/tex; mode=display">\hat{\theta}_{\text{PM}} = \frac{5 - 1}{5 + 2 - 2} = 0.8 \enspace .</script>
<p>In contrast to earlier, this estimate is <em>shrunk</em> towards $\theta = 0.5$. This came about because we have used prior information that stated that $\theta = 0.5$ is more likely than the other values (see figure with $a = b = 2$ above). Consequently, we were therefore less swayed by the somewhat unlikely situation (under no bias $\theta = 0.5$) of observing three heads out of three throws. It should thus not come as a surprise that Bayesian priors <em>can</em> act as regularizing devices. However, this requires careful application, especially in small sample size settings.</p>
<p>In a <em>Post Scriptum</em> to this blog post, I similarly show how the posterior mean, which is arguably are more natural point estimate as it takes the uncertainty about $\theta$ better into account than the posterior mode, can be viewed as a regularized estimate, too.</p>
<h2 id="penalized-estimation">Penalized estimation</h2>
<p>Bayesians are not the only ones who can add prior information to an estimation problem. Within the frequentist framework, penalized estimation methods add a penalty term to the log likelihood function, and then find the parameter value which maximizes this <em>penalized log likelihood</em>. We can implement such a method by optimizing an extended log likelihood:</p>
<script type="math/tex; mode=display">y \, \text{log}\,\theta + (n - y) \, \text{log} \, (1 - \theta) - \underbrace{\lambda (\theta - 0.5)^2}_{\text{Penalty Term}} \enspace ,</script>
<p>where we penalize values that a far from the parameter value which indicates no bias, $\theta = 0.5$. The larger $\lambda$, the stronger values of $\theta \neq 0.5$ get penalized. In addition to picking $\lambda$, the particular form of the penalty term is also important. Similar to assigning $\theta$ a prior distribution, although possibly less straightforward and less flexible, choosing the penalty term means incorporating information about the problem in addition to specifying a likelihood function. Above, we have used the <em>squared distance</em> from $\theta = 0.5$ as a penalty. We call this the $\mathcal{L}_2$-norm penalty<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>, but the $\mathcal{L}_1$-norm, which takes the <em>absolute distance</em>, is an equally interesting choice:</p>
<script type="math/tex; mode=display">y \, \text{log}\,\theta + (n - y) \, \text{log} \, (1 - \theta) - \lambda |\theta - 0.5| \enspace ,</script>
<p>As we will see below, these penalties have very different effects.</p>
<p>The penalized likelihood does not only depend on $\theta$, but also on $\lambda$. The code below evaluates the penalized log likelihood function given values for these two parameters. Note that we drop the normalizing constant ${n \choose y}$ as it does neither depend on $\theta$ nor on $\lambda$.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0.001</span><span class="p">,</span><span class="w"> </span><span class="m">.999</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">),</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">theta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">reg</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">get_penalized_likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Vectorize</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span></code></pre></figure>
<p>With only three data points it is futile to try to estimate $\lambda$ using, for example, cross-validation; however, this is also not the goal of this blog post. Instead, to get further intuition, we simply try out a number of values for $\lambda$ using the code below and see how it influences our estimate of $\theta$. Because the parameter space has only one dimension, we can easily find the value for $\theta$ which maximizes the penalized likelihood even without wearing our calculus hat. Specifically, given a particular value for $\lambda$, we evaluate the penalized likelihood function for a range of values of between zero and one and pick the value that minimizes it.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">estimate_path</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lambda_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">.01</span><span class="p">)</span><span class="w">
</span><span class="n">theta_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">.001</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">)</span><span class="w">
</span><span class="n">theta_best</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="nf">seq_along</span><span class="p">(</span><span class="n">lambda_seq</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">penalized_likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_penalized_likelihood</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">theta_seq</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_seq</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">reg</span><span class="p">)</span><span class="w">
</span><span class="n">theta_seq</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">penalized_likelihood</span><span class="p">)]</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="n">cbind</span><span class="p">(</span><span class="n">lambda_seq</span><span class="p">,</span><span class="w"> </span><span class="n">theta_best</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Sticking with the observations of three heads ($y = 3$) out of three throws ($n = 3$), the figure below plots best fitting values for $\theta$ given a range of values for $\lambda$. Observe that the $\mathcal{L}_1$-norm penalty shrinks it more quicker and abruptly to $\theta = 0.5$ at $\lambda = 6$, while the $\mathcal{L}_2$-norm penalty gradually (and rather slowly) shrinks the parameter to $\theta = 0.5$ with increasing $\lambda$. Why is this so?</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>First, note that because $\theta \in [0, 1]$ the squared distance will always be smaller than the absolute distance, which explains the slower shrinkage. Second, the fact that the $\mathcal{L}_1$-norm penalty can shrink <em>exactly</em> to $\theta = 0.5$ is due to the discontinuity of the absolute value function. The figures below provides some intuition. In particular, the figure on the left shows the $\mathcal{L}_1$-norm penalized likelihood function for a select number of $\lambda$’s. We see that for $\lambda < 3$, the value $\theta = 1$ performs best. With $\lambda \in [3, 6]$, values of $\theta \in [0.5, 1]$ become more likely than the extreme estimate $\theta = 1$. For $\lambda \geq 6$, the ‘no bias’ value $\theta = 0.5$ maximizes the penalized likelihood. Due to the discontinuity in the penalty, the shrinkage is exact. The $\mathcal{L}_2$-norm penalty, on the other hand, shrinks less strongly, and never exactly to $\theta = 0.5$, except of course for $\lambda \rightarrow \infty$. We can see this in the right figure below, where the penalized likelihood function is merely shifted to the left with increasing $\lambda$; this is in contrast to the $\mathcal{L}_1$-norm penalized likelihood on the left, for which the value $\theta = 0.5$ at the discontinuity takes a special place.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>You can play around with the code below to get an intuition for how different values of $\lambda$ influence the penalized likelihood function.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'latex2exp'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_pen_llh</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Penalized Likelihood'</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">nl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">lambdas</span><span class="p">)</span><span class="w">
</span><span class="n">theta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">.001</span><span class="p">,</span><span class="w"> </span><span class="m">.999</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">)</span><span class="w">
</span><span class="n">likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nl</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">theta</span><span class="p">))</span><span class="w">
</span><span class="n">normalize</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="nf">max</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">nl</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">log_likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_penalized_likelihood</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">reg</span><span class="p">)</span><span class="w">
</span><span class="n">likelihood</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">normalize</span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">log_likelihood</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="n">likelihood</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylab</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="s1">'$\\theta$'</span><span class="p">),</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w">
</span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'skyblue'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nl</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="n">likelihood</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylab</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w">
</span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'skyblue'</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">at</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.2</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s1">'$\\lambda = %.2f$'</span><span class="p">,</span><span class="w"> </span><span class="n">l</span><span class="p">)))</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s1">'topleft'</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">info</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">nl</span><span class="p">),</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">box.lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'skyblue'</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">lambdas</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="m">8</span><span class="p">)</span><span class="w">
</span><span class="n">plot_pen_llh</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="s1">'$L_1$ Penalized Likelihood'</span><span class="p">))</span><span class="w">
</span><span class="n">plot_pen_llh</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="s1">'$L_2$ Penalized Likelihood'</span><span class="p">))</span></code></pre></figure>
<p>In practice, one would reparameterize this model as a logistic regression, and use cross-validation to estimate the best value for $\lambda$; see the <em>Post Scriptum</em> for a sketch of this approach.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen two perspectives regularization illustrated on a very simple example: estimating the bias of a coin. We first derived the Binomial likelihood, connecting the data to a parameter $\theta$ which we took to be the bias of the coin, as well as the maximum likelihood estimate. Observing three heads out of three coin flips, we became slightly uncomfortable with the (extreme) estimate $\hat{\theta} = 1$. We have seen how, from a Bayesian perspective, one can add prior information to this estimation problem, and how this led to an estimate that was <em>shrunk</em> towards $\theta = 0.5$. Within the frequentist framework, one can add information by augmenting the likelihood function with a penalty term. The type of information we want to incorporate corresponds to the particular penalty term. In this blog post, we have focused on the most commonly used penalty terms: the $\mathcal{L}_1$-norm, which shrinks parameters exactly to a particular value; and the $\mathcal{L}_2$-norm penalty, which provides continuous shrinkage. A future blog post might look into linear regression models where regularization methods abound and study how, for example, the popular Lasso can be recast in Bayesian terms.</p>
<hr />
<p><em>I would like to thank Jonas Haslbeck, Don van den Bergh, and Sophia Crüwell for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="posterior-mean">Posterior mean</h3>
<p>You may argue that one should use the mean instead of the mode as a posterior summary measure. If one does this, then there is already some shrinkage for the case of uniform priors. The mean of the posterior distribution is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[\theta] &= \frac{a'}{a' + b'} \\[.5em]
&= \frac{a + y}{a + y + b + n - y} \\[.5em]
&= \frac{a + y}{a + b + n} \enspace .
\end{aligned} %]]></script>
<p>As so often in mathematics, we can rewrite this in a more complicated manner to gain insight into how Bayesian priors shrink estimates:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[\theta] &= \frac{a}{a + b + n} + \frac{y}{a + b + n} \\[.5em]
&= \frac{a}{a + b + n} \left(\frac{a + b}{a + b}\right) + \frac{y}{a + b + n} \left( \frac{n}{n} \right) \\[.5em]
&= \frac{a + b}{a + b + n} \underbrace{\left(\frac{a}{a + b}\right)}_{\text{Prior mean}} + \frac{n}{a + b + n} \underbrace{\left( \frac{y}{n} \right)}_{\text{MLE}} \enspace .
\end{aligned} %]]></script>
<p>This decomposition shows that the posterior mean is a weighted combination of the prior mean and the maximum likelihood estimate. Since we can think of $a + b$ as the prior data, note that $a + b + n$ can be thought of as the <em>total</em> number of data points. The prior mean is thus weighted be the proportion of prior to total data, while the maximum likelihood estimate is weighted by the proportion of sample data to total data. This provides another perspective on how Bayesian priors regularize estimates.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<h3 id="penalized-logistic-regression">Penalized logistic regression</h3>
<p>Cross-validation might be a bit awkward when we represent the data using only $y$ and $n$. We can go back to the product of Bernoulli representation, which uses all individual data points $x_i$. This results in a logistic regression problem with likelihood:</p>
<script type="math/tex; mode=display">p(x_1, \ldots, x_n \mid \beta) = \prod_{i=1}^n \left(\frac{1}{1 + \text{exp}^{-\beta}}\right)^{x_i} \left(1 - \frac{1}{1 + \text{exp}^{-\beta}}\right)^{1 - x_i}\enspace ,</script>
<p>where we use a sigmoid function as the link function, and $\beta$ is on the log odds scale. The penalized log likelihood function can be written as</p>
<script type="math/tex; mode=display">\sum_{i=1}^n \left[ x_i \, \text{log} \left(\frac{1}{1 + \text{exp}^{-\beta}}\right) + (1 - x_i) \left(1 - \frac{1}{1 + \text{exp}^{-\beta}}\right) \right] - \lambda |\beta| \enspace ,</script>
<p>where because $\beta = 0$ corresponds to $\theta = 0.5$, we do not need to subtract in the penalty term. This parameterization also makes it more easy to study which types of priors on $\beta$ result in an $\mathcal{L}_1$ or $\mathcal{L}_2$ norm penalty (spoiler: it’s a Laplace and the Gaussian, respectively). Such models can be estimated using the R package <em>glmnet</em>, although it does not work for the exceedingly small sample we have played with in this blog post. This seems to imply that regularization is more natural in the Bayesian framework, which additionally allows more flexible specification of prior knowledge.</p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Gelman, A., & Nolan, D. (<a href="https://www.tandfonline.com/doi/abs/10.1198/000313002605">2002</a>). You can load a die, but you can’t bias a coin. <em>The American Statistician, 56</em>(4), 308-311.</li>
<li>Stigler, S. M. (<a href="https://projecteuclid.org/euclid.ss/1207580174">2007</a>). The Epic Story of Maximum Likelihood. <em>Statistical Science, 22</em>(4), 598-620.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I don’t think anybody actually ever is interested in estimating the bias of a coin. In fact, one <em>cannot bias a coin</em> if we are only allowed to flip it in the usual manner (see Gelman & Nolan, <a href="https://www.tandfonline.com/doi/abs/10.1198/000313002605">2002</a>). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>In a wonderful paper humbly titled <em>The Epic Story of Maximum Likelihood</em>, Stigler (<a href="https://projecteuclid.org/euclid.ss/1207580174">2007</a>) says that maximum likelihood estimation must have been familiar even to hunters and gatherers, although they would not have used such fancy words, as the idea is exceedingly simple. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Strictly speaking, this is incorrect: the only norm that exists for the one-dimensional vector space is the absolute value norm. Thus, in our example with only one parameter $\theta$ there is no notion of an $\mathcal{L}_2$-norm. However, because of the analogy to the regression and more generally multidimensional setting, I hope that this inaccuracy is excused. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>It also shows that in the limit of infinite data, the posterior mean converges to the maximum likelihood estimate. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderRegularization is the process of adding information to an estimation problem so as to avoid extreme estimates. Put differently, it safeguards against foolishness. Both Bayesian and frequentist methods can incorporate prior information which leads to regularized estimates, but they do so in different ways. In this blog post, I illustrate these two different perspectives on regularization on the simplest example possible — estimating the bias of a coin. Modeling coin flips Let’s say that we are interested in estimating the bias of a coin, which we take to be the probability of the coin showing heads.1 In this section, we will derive the Binomial likelihood — the statistical model that we will use for modeling coin flips. Let $X \in [0, 1]$ be a discrete random variable with realization $X = x$. Flipping the coin once, let the outcome $x = 0$ correspond to tails and $x = 1$ to heads. We use the Bernoulli likelihood to connect the data to the latent parameter $\theta$, which we take to be the bias of the coin: There is no point in estimating the bias by flipping the coin only once. We are therefore interested in a model that can account for $n$ coin flips. If we are willing to assume that the individual coin flips are independent and identically distributed conditional on $\theta$, we can obtain the joint probability of all outcomes by multiplying the probability of the individual outcomes: For the purposes of estimating the coin’s bias, we actually do not care about the order in which the coins come up heads or tails; we only care about how frequently the coin shows heads or tails out of $n$ throws. Thus, we do not model the individual outcomes $X_i$, but instead model their sum $Y = \sum_{i=1}^n X_i$. We write: where we suppress conditioning on $n$ to not clutter notation. Note that our model is not complete — we need to account for the fact that there are several ways to get $y$ heads out of $n$ throws. For example, we can get $y = 2$ with $n = 3$ in three different ways: $(1, 1, 0)$, $(0, 1, 1)$, and $(1, 0, 1)$. If we were to use the model above, we would underestimate the probability of observing two heads out of three coin tosses by a factor of three. In general, there are $n!$ possible ways in which we can order the outcomes. To see this, think of $n$ containers. The first outcome can go in any container, the second one in any container but the container which houses the first outcome, and so on, which yields: However, we only care about $y$ of them, so we need to remove the remaining $(n - y)!$ possible ways. Moreover, once we have taken $y$ outcomes, we do not care about their order; thus we remove another $y!$ permutations. Therefore, for any particular sequence of coin flips of length $n$, there are ways to get $y$ heads out of $n$ throws. The funny looking symbol on the right is the Binomal coefficient. The probability of the data is therefore given by the Binomial likelihood: which just adds the term ${n \choose y}$ to the equation we had above after introducing $Y$. For the example of observing $y = 2$ heads out of $n = 3$ coin flips, the Binomial coefficient is ${3 \choose 2} = 3$, which accounts for the fact that there are three possible ways to get two heads out of three throws. The data Assume we flip the coin three times, $n = 3$, and observe three heads, $y = 3$. How can we estimate the bias of the coin? In the next sections, we will use the Binomial likelihood derived above and discuss three different ways of estimating the coin’s bias: maximum likelihood estimation, Bayesian estimation, and penalized maximum likelihood estimation. Classical estimation Within the frequentist paradigm, the method of maximum likelihood is arguably the most popular method for parameter estimation: choose as an estimate for $\theta$ the value which maximizes the likelihood of the data.2 To get a feeling for how the likelihood of the data differs across values of $\theta$, let’s pick two values, $\theta_1 = .5$ and $\theta_2 = 1$, and compute the likelihood of observing three heads out of three coin flips: We therefore conclude that the data are more likely for a coin that has bias $\theta_1 = 1$ than for a coin that has bias $\theta_2 = 0.5$. But is it the most likely value? To compare all possible values for $\theta$ visually, we plot the likelihood as a function of $\theta$ below. The left figure shows that, indeed, $\theta = 1$ maximizes the likelihood for the data. The right figure shows the likelihood function for $y = 15$ heads out of $n = 20$ coin flips. Note that, in contrast to probabilities, which need to sum to one, likelihoods do not have a natural scale. Do these two examples allow us to derive a general principle for how to estimate the bias of a coin? Let $\hat{\theta}$ denote an estimate of the population parameter $\theta$. The two figures above suggests that $\hat{\theta} = \frac{y}{n}$ is the maximum likelihood estimate for an arbitrary data set $d = (y, n)$ … and it is! To arrive at this mathematically, we can find the maximum of this likelihood function by taking the derivative with respect to $\theta$, and setting it to zero (see also a previous post). In other words, we solve for the value of $\theta$ for which the derivative does not change; and since the Binomial likelihood is unimodal, this maximum will be unique. Note the value for $\theta$ at which the likelihood function has its maximum does not change when we take logs, but because the mathematics is greatly simplified, we do so: which shows that indeed $\frac{y}{n}$ is the maximum likelihood estimate. Bayesian estimation Bayesians assign priors to parameters in addition to the likelihood, which takes a central role in all statistical paradigms. For this Binomial problem, we assign $\theta$ a Beta prior: As we will see below, this prior allows easy Bayesian updating while being sufficiently flexible in incorporating prior information. The figure below shows different Beta distributions, formalizing our prior belief about values of $\theta$. The figure in the top left corner assigns uniform prior plausibility to all values of $\theta$; the figures to its right incorporate a slight bias towards the extreme values $\theta = 1$ and $\theta = 0$. With increasing $a$ and $b,$ the prior becomes more biased towards $\theta = 0.5$; with decreasing $a$ and $b$, the prior becomes biased against $\theta = 0.5$. As shown in a previous blog post, the Beta distribution is conjugate to the Binomial likelihood, which means that the posterior distribution of $\theta$ is again a Beta distribution: where $a’ = a + y$ and $b’ = b + y - n$. Under this conjugate setup, the parameters of the prior can be understood as prior data; for example, if we choose prior parameters $a = b = 1$, then we assume that we have seen one heads and one tails prior to data collection. The figure below shows two examples of such Bayesian updating processes. In both cases, we observe $y = 3$ heads out of $n = 3$ coin flips. On the left, we assign $\theta$ a uniform prior. The resulting posterior distribution is proportional to the likelihood (which we have rescaled to fit nicely in the graph) and thus does not appear as a separate line. After we have seen the data, we can compute the posterior mode as our estimate for the most likely value of $\theta$. Observe that the posterior mode is equivalent to the maximum likelihood estimate: This is in fact the case for all statistical estimation problems where we assign a uniform prior to the (possibly high-dimensional) parameter vector $\theta$. To prove this, observe that: since we can drop the normalizing constant $p(y)$, because it does not depend on $\theta$, and $p(\theta)$, because it is a constant assigning all values of $\theta$ equal probability. Using a Beta prior with $a = b = 2$, as shown on the right side of the figure above, we see that the posterior is not proportional to the likelihood anymore. This in turn means that the mode of the posterior distribution does no longer correspond to the maximum likelihood estimate. In this case, the posterior mode is: In contrast to earlier, this estimate is shrunk towards $\theta = 0.5$. This came about because we have used prior information that stated that $\theta = 0.5$ is more likely than the other values (see figure with $a = b = 2$ above). Consequently, we were therefore less swayed by the somewhat unlikely situation (under no bias $\theta = 0.5$) of observing three heads out of three throws. It should thus not come as a surprise that Bayesian priors can act as regularizing devices. However, this requires careful application, especially in small sample size settings. In a Post Scriptum to this blog post, I similarly show how the posterior mean, which is arguably are more natural point estimate as it takes the uncertainty about $\theta$ better into account than the posterior mode, can be viewed as a regularized estimate, too. Penalized estimation Bayesians are not the only ones who can add prior information to an estimation problem. Within the frequentist framework, penalized estimation methods add a penalty term to the log likelihood function, and then find the parameter value which maximizes this penalized log likelihood. We can implement such a method by optimizing an extended log likelihood: where we penalize values that a far from the parameter value which indicates no bias, $\theta = 0.5$. The larger $\lambda$, the stronger values of $\theta \neq 0.5$ get penalized. In addition to picking $\lambda$, the particular form of the penalty term is also important. Similar to assigning $\theta$ a prior distribution, although possibly less straightforward and less flexible, choosing the penalty term means incorporating information about the problem in addition to specifying a likelihood function. Above, we have used the squared distance from $\theta = 0.5$ as a penalty. We call this the $\mathcal{L}_2$-norm penalty3, but the $\mathcal{L}_1$-norm, which takes the absolute distance, is an equally interesting choice: As we will see below, these penalties have very different effects. The penalized likelihood does not only depend on $\theta$, but also on $\lambda$. The code below evaluates the penalized log likelihood function given values for these two parameters. Note that we drop the normalizing constant ${n \choose y}$ as it does neither depend on $\theta$ nor on $\lambda$. I don’t think anybody actually ever is interested in estimating the bias of a coin. In fact, one cannot bias a coin if we are only allowed to flip it in the usual manner (see Gelman & Nolan, 2002). ↩ In a wonderful paper humbly titled The Epic Story of Maximum Likelihood, Stigler (2007) says that maximum likelihood estimation must have been familiar even to hunters and gatherers, although they would not have used such fancy words, as the idea is exceedingly simple. ↩ Strictly speaking, this is incorrect: the only norm that exists for the one-dimensional vector space is the absolute value norm. Thus, in our example with only one parameter $\theta$ there is no notion of an $\mathcal{L}_2$-norm. However, because of the analogy to the regression and more generally multidimensional setting, I hope that this inaccuracy is excused. ↩Variable selection using Gibbs sampling2019-03-31T13:00:00+00:002019-03-31T13:00:00+00:00https://fabiandablander.com/r/Spike-and-Slab<p>“Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using <em>spike-and-slab priors</em> and the <em>Gibbs sampler</em>, a computational method to sample from a joint distribution using only conditional distributions.</p>
<p>Variable selection is a beast. To slay it, we must draw on ideas from different fields. We have to discuss the basics of Bayesian inference which motivates our principal weapon, the Gibbs sampler. As an instruction manual, we apply it to a simple example: drawing samples from a bivariate Gaussian distribution (for pre-combat exercises, see <a href="https://fdabl.github.io/statistics/Two-Properties.html">here</a>). The Gibbs sampler feeds on conditional distributions. To be able to derive those easily, we need to equip ourselves with $d$-separation and directed acyclic graphs (DAGs). Having trained and become stronger, we attack variable selection in the linear regression case using Gibbs sampling with spike-and-slab priors. These priors are special in that they are a discrete mixture of a Dirac delta function — which can shrink regression coefficients exactly to zero — and a Gaussian distribution. We tackle the single predictor case first, and then generalize it to $p > 1$ predictors. For $p$ predictors, the Gibbs sampler with spike-and-slab priors yields a posterior distribution over all possible $2^p$ regression models, an enormous feat. From this, posterior inclusion probabilities and model-averaged parameter estimates follow straightforwardly. To wield this weapon in practice, we implement the method in R and engage in variable selection on simulated and real data. Seems like we have a lot to cover, so let’s get started!</p>
<h1 id="quantifying-uncertainty">Quantifying uncertainty</h1>
<p>Bayesian inference is an excellent tool for uncertainty quantification. Assume you have assigned a prior distribution to some parameter $\beta$ of a model $\mathcal{M}$, call it $p(\beta \mid \mathcal{M})$. After you have observed data $\mathbf{y}$, how should you update your belief to arrive at the posterior, $p(\beta \mid y, \mathcal{M})$? The rules of probability dictate:</p>
<script type="math/tex; mode=display">\underbrace{p(\beta \mid \mathbf{y}, \mathcal{M})}_{\text{Posterior}} = \underbrace{p(\beta \mid \mathcal{M})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{y} \mid \beta, \mathcal{M})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{y} \mid \beta, \mathcal{M}) \, p(\beta \mid \mathcal{M}) \, \mathrm{d} \beta}_{\text{Marginal Likelihood}}} \enspace .</script>
<p>The computationally easy parts of the right-hand side is the specification of the prior and, unless you do <a href="https://en.wikipedia.org/wiki/Approximate_Bayesian_computation">crazy things</a>, also the likelihood. The tough bit is the marginal likelihood or <em>normalizing constant</em> which, as the name implies, makes the posterior distribution integrate to one, as all proper probability distributions must. In contrast to differentiation, which is a local operation, integration is a global operation and is thus <a href="https://xkcd.com/2117/">much harder</a>. It becomes even harder with many parameters.</p>
<p>Usually, Bayes’ rule is given without conditioning on the model, $\mathcal{M}$. However, this assumes that we know one model to be true with certainty, thus ignoring the uncertainty we have about the models. We can apply Bayes’ rule not only on parameters, but also on models:</p>
<script type="math/tex; mode=display">p(\mathcal{M} \mid \mathbf{y}) = p(\mathcal{M}) \times \frac{p(\mathbf{y} \mid \mathcal{M})}{\sum_{i = 1}^m p(\mathbf{y} \mid \mathcal{M}_i) \, p(\mathcal{M}_i)} \enspace ,</script>
<p>where $m$ is the number of all models and</p>
<script type="math/tex; mode=display">p(\mathbf{y} \mid \mathcal{M}) = \int p(\mathbf{y} \mid \mathcal{M}, \beta) \, p(\beta \mid \mathcal{M}) \, \mathrm{d} \beta \enspace ,</script>
<p>is in fact the marginal likelihood of our first equation. To illustrate how one could do variable selection, assume we have two models, $\mathcal{M}_1$ and $\mathcal{M}_2$, which differ in their number of predictors:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{M}_2&: \mathbf{y} = \beta_0 + \beta_1 \mathbf{x}_1 \\[0.5em]
\mathcal{M}_4&: \mathbf{y} = \beta_0 + \beta_1 \mathbf{x}_1 + \beta_2 \mathbf{x}_2 \enspace .
\end{aligned} %]]></script>
<p>If these two are the only models we consider, then we can quantify their respective merits using posterior odds:</p>
<script type="math/tex; mode=display">\underbrace{\frac{p(\mathcal{M}_4 \mid \mathbf{y})}{p(\mathcal{M}_2 \mid \mathbf{y})}}_{\text{Posterior Odds}} = \underbrace{\frac{p(\mathcal{M}_4)}{p(\mathcal{M}_2)}}_{\text{Prior Odds}} \times \underbrace{\frac{p(\mathbf{y} \mid \mathcal{M}_4)}{p(\mathbf{y} \mid \mathcal{M}_2)}}_{\text{Bayes factor}} \enspace ,</script>
<p>where we can interpret the Bayes factor as an indicator for how much more likely the data are under $\mathcal{M}_4$, which includes $\beta_2$, compared to $\mathcal{M}_2$, which does not include $\beta_2$. However, two additional regression models are possible:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{M}_1&: \mathbf{y} = \beta_0\\[0.5em]
\mathcal{M}_3&: \mathbf{y} = \beta_0 + \beta_2 \mathbf{x}_2 \enspace .
\end{aligned} %]]></script>
<p>In general, if $p$ is the number of predictors, then there are $2^p$ possible regression models in total. If we ignore some of those a priori, we will have violated <em>Cromwell’s rule</em>, which states that we should never assign prior probabilities of zero to things that could possibly happen. Otherwise, regardless of the evidence, we would never change our mind. As Dennis Lindley put it, we should</p>
<blockquote>
<p>“[…] leave a little probability for the moon being made of green cheese; it can be as small as 1 in a million, but have it there since otherwise an army of astronauts returning with samples of the said cheese will leave you unmoved.” (Lindley, 1991, p. 101)</p>
</blockquote>
<p>One elegant aspect about the Bayes factor is that we do not need to compute the normalizing constant of all models (it cancels in the ratio), which would require us to enumerate and assign priors to all possible models. If we are willing to do this, however, then we can model-average to get a posterior distribution of $\beta_j$ that takes into account the uncertainty about all $m$ models:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta_j \mid \mathbf{y}) &= \sum_{i=1}^m \, p(\beta_j, \mathcal{M}_i \mid \mathbf{y}) \\[.5em]
&= \sum_{i=1}^m \, p(\beta_j \mid \mathbf{y}, \mathcal{M}_i) \, p(\mathcal{M}_i \mid \mathbf{y}) \enspace ,
\end{aligned} %]]></script>
<p>which requires computing the posterior distribution over the parameter of interest $\beta_j$ in each model $\mathcal{M}_j$, as well as the posterior distribution over all such models. Needless to say, this is a difficult problem; the bulk of this blog post is to find an efficient way to do this in the context of linear regression models. For variable selection, we might be interested in another quantity: the posterior probability that $\beta_j \neq 0$, averaged over all models. We can arrive at this by similar means:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta_j \neq 0 \mid \mathbf{y}) & = \sum_{i=1}^m \, p(\beta_j \neq 0, \mathcal{M}_i \mid \mathbf{y}) \\[.5em]
&= \sum_{i=1}^m \, p(\beta_j \neq 0 \mid \mathbf{y}, \mathcal{M}_i) \, p(\mathcal{M}_i \mid \mathbf{y}) \enspace .
\end{aligned} %]]></script>
<p>Note that conditional on a model $\mathcal{M}_i$, $\beta_j$ is either zero or not zero. Therefore, all the terms in which $\beta_j$ is zero drop out of the sum, and we are left with summing the posterior model probabilities for the models in which $\beta_j \neq 0$. This model-averaging perspective strikes me as a very elegant approach to variable selection.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> In the remainder of this blog post, we will solve this variable selection problem for linear regression using the Gibbs sampler with spike-and-slab priors.</p>
<h1 id="gibbs-sampling">Gibbs sampling</h1>
<p>Much of the advent in Bayesian inference in the last few decades is due to methods that arrive at the posterior distribution without calculating the marginal likelihood. One such method is the Gibbs sampler, which breaks down a high-dimensional problem into a number of smaller low-dimensional problems. It’s really one of the coolest things in statistics: it samples from the joint posterior distribution and its marginals by sampling from the conditional posterior distributions. To prove that it works mathematically is not trivial, and beyond this already lengthy introductory blog post.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> Thus, instead of getting bogged down in the technical details, let’s take a look at a motivating example.</p>
<h2 id="sampling-from-a-bivariate-gaussian">Sampling from a bivariate Gaussian</h2>
<p>To illlustrate, let $X_1$ and $X_2$ be bivariate normally distributed random variables with population mean zero ($\mu_1 = \mu_2 = 0$), unit variance ($\sigma_1^2 = \sigma_2^2 = 1$), and correlation $\rho$. As you may recall from a <a href="https://fdabl.github.io/statistics/Two-Properties.html">previous</a> blogpost, the conditional Gaussian distribution of $X_1$ given $X_2 = x_2$, and $X_2$ given $X_1 = x_1$, respectively, are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
X_1 \mid X_2 = x_2 &\sim \mathcal{N}\left(\rho x_2, \, (1 - \rho^2)\right) \\[0.5em]
X_2 \mid X_1 = x_1 &\sim \mathcal{N}\left(\rho x_1, \, (1 - \rho^2)\right) \enspace .
\end{aligned} %]]></script>
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- x_2^1 &\sim \mathcal{N}(0, 1) \\[.5em] -->
<!-- x_1^2 &\sim \mathcal{N}\left(\rho x_2^1, \, (1 - \rho^2)\right) \\[0.5em] -->
<!-- x_2^2 &\sim \mathcal{N}\left(\rho x_1^2, \, (1 - \rho^2)\right) \\[0.5em] -->
<!-- x_1^3 &\sim \mathcal{N}\left(\rho x_2^2, \, (1 - \rho^2)\right) \\[0.5em] -->
<!-- x_2^3 &\sim \mathcal{N}\left(\rho x_1^3, \, (1 - \rho^2)\right) \\[0.5em] -->
<!-- \vdots &\sim \vdots \\[.5em] -->
<!-- \end{aligned} -->
<!-- $$ -->
<p>The Gibbs sampler makes it so that if we sample repeatedly from these two conditional distributions:</p>
<script type="math/tex; mode=display">(x_1^1, x_2^1), (x_1^2, x_2^2), \ldots, (x_1^{n - 1}, x_2^{n - 1}), (x_1^n, x_2^n) \enspace ,</script>
<p>then these will be samples from the joint distribution $p(X_1, X_2)$ and its marginals.</p>
<!-- The astounding thing with Gibbs sampling is that, if we sample $x_1^t$ from the conditional distribution $p(X_1^t \mid X_2 = x_2^{t-1})$ and $x_2^t$ from the conditional distribution $p(X_2^t \mid X_1 = x_1^{t-1})$, then under some regularity conditions, the joint samples will be from the bivariate Gaussian distribution! -->
<p>To illustrate, we implement this Gibbs sampler in R.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sample_bivariate_normal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">rho</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">rho</span><span class="o">*</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">rho</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="c1"># sample from p(X1 | X2 = x2)</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">rho</span><span class="o">*</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">rho</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="c1"># sample from p(X2 | X1 = x1)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">x</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Let’s see it in action:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample_bivariate_normal</span><span class="p">(</span><span class="n">rho</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w">
</span><span class="n">cov</span><span class="p">(</span><span class="n">samples</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1] [,2]
## [1,] 1.0178545 0.5091747
## [2,] 0.5091747 0.9949518</code></pre></figure>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>Wait a minute, you might say. In this toy example, what was the prior distribution, and which posterior did we compute? The answer is: there were none! We have used the Gibbs sampler not to learn about a parameter, but rather to illustrate that sampling from conditional distributions in this way results in samples from the joint distribution. In the next section, we look at how graphs can help us in finding conditional independencies.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<!-- Although you could get philosophical and ask: what exactly is the [difference](https://www.tandfonline.com/doi/abs/10.1080/15366360802035497) between 'data' (here $x_1$ and $x_2$) and parameters (here $\rho$)? -->
<!-- ## Example II: Another thing -->
<!-- The example above is a little fishy: in the Gaussian case, if we know both conditional distributions, then we also [know the joint distribution](https://fdabl.github.io/statistics/Two-Properties.html)! -->
<h1 id="conditional-independence-and-graphs">Conditional independence and graphs</h1>
<p>Before we look into variable selection using spike-and-slab priors in the linear regression case, we need to get some preliminaries about conditional independence out of the way. We write:</p>
<script type="math/tex; mode=display">X \perp Y \hspace{.4em} \vert\, Z \enspace ,</script>
<p>to denote that $X$ and $Y$ are <em>conditionally independent</em> given $Z$.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup> We can visualize conditional independencies between random variables using directed acyclic graphs (DAGs). The figure below distinguishes between three different DAG structures.</p>
<p><img src="../assets/img/DAGs-SS.png" /></p>
<p>DAG (a) above is a <em>common cause</em> structure. A good example is the positive correlation between the number of storks and the number of human babies delivered; these two variables become independent once one conditions on the common cause <em>economic development</em> (Matthews, 2001). DAG (b) is an example where the effect of $X$ on $Y$ is <em>fully mediated</em> by $Z$: conditional on $Z$, $X$ does not have an effect on $Y$. Thus, both in DAGs (a) and (b), conditioning on $Z$ renders $X$ and $Y$ independent.</p>
<p>Two variables can also be <em>marginally independent</em>, for which we write:</p>
<script type="math/tex; mode=display">X \perp Y \enspace ,</script>
<p>which holds in DAG (c). Note, however, that if we would condition on $Z$ in DAG (c), then $X$ and $Y$ would become <em>dependent</em>. $Z$ is a <em>collider</em>, and conditioning on it induces a dependency between $X$ and $Y$. Although not visible in the DAG, a dependency would also have been induced btween $X$ and $Y$ if we had conditioned on any children of $Z$.</p>
<!-- There are various good examples of so-called *collider bias*; for example: ... -->
<p>Note that although we visualize the conditional independencies in a DAG, we do not interpret it causally. We are merely interested in <em>seeing</em>, not <em>doing</em>, and view the arrows as “incidental construction features supporting the
$d$-separation semantics” (Dawid, 2010 p. 90).</p>
<!-- From $d$-separation, we can distill the following factorization for the conditional probability of a node: -->
<!-- $$ -->
<!-- p(X \mid \mathcal{G} \setminus \{X\}) \propto \,p(X \mid \text{Pa}(X)) \, \prod_{Y \in \text{Ch(X)}} p(Y \mid \text{Pa(Y)}) \enspace . -->
<!-- $$ -->
<p>As we will see in the next section, being able to read conditional independencies from a graph greatly aids in finding conditional distributions feeding the Gibbs sampler.</p>
<h1 id="spike-and-slab-regression">Spike-and-Slab Regression</h1>
<h2 id="model-specification">Model specification</h2>
<p>In a previous blog post, we discussed the (history of the) methods of least squares and <a href="https://fdabl.github.io/statistics/Curve-Fitting-Gaussian.html">linear regression</a>. However, we did not assess whether a particular variable $X$ is actually associated with an outcome $Y$. We can think of this problem as hypothesis testing, variable selection, or structure learning. In particular, we may write the regression model with a single predictor variable as:</p>
<script type="math/tex; mode=display">y_i \sim \mathcal{N}(\beta \, x_i , \sigma_e^2) \enspace .</script>
<p>We put the following prior on $\beta$:</p>
<script type="math/tex; mode=display">\beta \sim (1 - \pi) \, \delta_0 + \pi \, \mathcal{N}(0, \sigma_y^2 \tau^2) \enspace ,</script>
<p>where $\pi \in [0, 1]$ is a mixture weight, $\sigma_y^2$ is the variance of $\mathbf{y}$, $\delta_0$ is the <a href="https://en.wikipedia.org/wiki/Dirac_delta_function">Dirac delta function</a> (the <em>spike</em>), and $\tau^2$ is the variance of the <em>slab</em>. We multiply $\tau^2$ with $\sigma_y^2$ so that the prior naturally scales with the scale of the outcome. If we would not do this, then our results would depend on the measurement units of $\mathbf{y}$. Instead of fixing $\tau^2$ to a constant, we learn it by specifying</p>
<script type="math/tex; mode=display">\tau^2 \sim \text{Inverse-Gamma}(1/2, s^2/2) \enspace ,</script>
<p>which results in a scale-mixture of Gaussians, that is, a Cauchy distribution with scale $s$. The figure below visualizes the marginal prior on $\beta$ as a discrete mixture ($\pi = 0.5$) of a Dirac delta, a Cauchy with scale $s = 1/2$, and $\sigma_y^2 = 1$.</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>The idea behind this specification is to allow the regression weight $\beta$ to be <em>exactly</em> zero. Using Gibbs sampling, we will arrive at $p(\pi \mid y)$ which indicates the posterior probability of the parameter $\beta$ being zero. We continue the prior specification with</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\pi &\sim \text{Bern}(\theta) \\[0.5em]
\theta &\sim \text{Beta}(a, b) \\[0.5em]
\sigma_e^2 &\sim \text{Inverse-Gamma}(\alpha_1, \alpha_2) \enspace ,
\end{aligned} %]]></script>
<p>where we set $a = b = 1$ and $\alpha_1 = \alpha_2 = 2$. We can visualize the relations between all random variables in a DAG, see the figure below. Nodes with a grey shadow are observed or set by us, white nodes denote random variables.</p>
<p><img src="../assets/img/SS-GM.png" /></p>
<p>Using $d$-separation as introduced in the previous section, we note that this larger graph is basically a collection of DAGs (b) and (c). This helps us see that the joint probability distribution factors:</p>
<script type="math/tex; mode=display">p(\mathbf{y}, \beta, \pi, \theta, \tau^2, \sigma_e^2) = p(\mathbf{y} \mid \beta, \sigma_e^2) \, p(\sigma_e^2) \, p(\beta \mid \pi, \tau^2) \, p(\pi \mid \theta) \, p(\theta) \, p(\tau^2) \enspace ,</script>
<p>where we have suppressed conditioning on the hyperparameters $a = b = 1$, $\alpha_1 = \alpha_2 = 0.01$, $s = 1/2,$ the predictor variables $X$, and the variance of the outcome $\sigma_y^2$.</p>
<p>For the Gibbs sampler, we need the conditional posterior distribution of each parameter given the data and all other parameters. Using the conditional independence structure of the graph, this results in the following conditional distributions:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&p(\theta \mid \mathbf{y}, \beta, \pi, \tau^2, \sigma_e^2 ) = p(\theta \mid \pi) \\[0.5em]
&p(\tau^2 \mid \mathbf{y}, \beta, \pi, \theta, \sigma_e^2) = p(\tau^2 \mid \beta, \pi) \\[0.5em]
&p(\sigma_e^2 \mid \mathbf{y}, \beta, \pi, \theta, \tau^2) = p(\sigma_e^2 \mid \mathbf{y}, \beta) \\[0.5em]
&p(\pi \mid \mathbf{y}, \beta, \theta, \tau^2, \sigma_e^2) = p(\pi \mid \beta, \theta, \tau^2) \\[0.5em]
&p(\beta \mid \mathbf{y}, \pi, \theta, \tau^2, \sigma_e^2) = p(\beta \mid \mathbf{y}, \pi, \tau^2, \sigma_e^2) \enspace .
\end{aligned} %]]></script>
<!-- These conditional independencies result in *local computation*: certain parts are shielded from other parts of the graph. The shield is called the *Markov blanket*. Information trickles through the graph from node to node. -->
<p>In the next sections, we derive these conditional posterior distributions in turn. Since the single predictor case is slightly simpler to follow, we focus on it. However, the generalization to the multiple predictor setting is relatively straightforward, and I will sketch it afterwards.</p>
<h2 id="conditional-posterior-ptheta-mid-pi">Conditional posterior $p(\theta \mid \pi)$</h2>
<p>We expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\theta \mid \pi) &= \frac{p(\pi \mid \theta) \, p(\theta)}{\int p(\pi \mid \theta) \, p(\theta) \, \mathrm{d}\theta} \\[.5em]
&=\frac{\theta^\pi (1 - \theta)^{1 - \pi} \frac{1}{B(a,b)} \theta^{a - 1} (1 - \theta)^{b - 1}}{\int \theta^\pi (1 - \theta)^{1 - \pi} \frac{1}{B(a,b)} \theta^{a - 1} (1 - \theta)^{b - 1} \, \mathrm{d}\theta} \\[.5em]
&= \frac{\theta^{(a + \pi) - 1} (1 - \theta)^{(b + 1 - \pi) - 1}}{\int \theta^{(a + \pi) - 1} (1 - \theta)^{(b + 1 - \pi) - 1} \, \mathrm{d}\theta} \enspace ,
\end{aligned} %]]></script>
<p>where $B$ is the <a href="https://en.wikipedia.org/wiki/Beta_function">beta function</a>, and where we realize the numerator is the <em>kernel</em> of a Beta distribution, and the denominator is the normalizing constant. Thus, the posterior is again a Beta distribution:</p>
<script type="math/tex; mode=display">\theta \mid \pi \sim \text{Beta}(a + \pi, b + 1 - \pi) \enspace .</script>
<p>As we can see, the conditional posterior of $\theta$ only depends on $\pi$. That means, however, that we can never get much information about this parameter, as $\pi$ can only be 0 or 1, and so the Beta distribution can only become $\text{Beta}(2, 1)$ or $\text{Beta}(1, 2)$ with a uniform prior $a = b = 1$. The posterior mean of $\theta$ can thus never become larger than $2/3$ or smaller than $1/3$.</p>
<h2 id="conditional-posterior-ptau2-mid-beta-pi">Conditional posterior $p(\tau^2 \mid \beta, \pi)$</h2>
<p>The conditional posterior on $\tau^2$ also depends on $\pi$ because conditioning on $\beta$ means conditioning on a collider, inducing the dependency. We expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\tau^2 \mid \beta, \pi) &= \frac{p(\beta \mid \tau^2, \pi) \, p(\pi) \, p(\tau^2)}{\int p(\beta \mid \tau^2, \pi) \, p(\pi) \, p(\tau^2) \, \mathrm{d}\tau^2} \\[.5em]
&= \frac{p(\beta \mid \tau^2, \pi) \, p(\tau^2)}{\int p(\beta \mid \tau^2, \pi) \, p(\tau^2) \, \mathrm{d}\tau^2} \enspace .
\end{aligned} %]]></script>
<p>To make the notation less cluttered, we will call the normalizing constant in this and all following derivations $Z$. Note that terms that do not depend on the parameter of interest in the numerator cancel, as the same terms appear in the normalizing constant. Further note that $\pi$ can be either 0 or 1. We first tackle the $\pi = 1$ case and write</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\tau^2 \mid \beta, \pi = 1) &= \frac{1}{Z} \, p(\beta \mid \tau^2, \pi) \, p(\tau^2) \\[0.5em]
&= \frac{1}{Z} \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right) \frac{\left(\frac{s^2}{2}\right)^{\frac{1}{2}}}{\Gamma\left(\frac{1}{2}\right)} \left(\tau^2\right)^{- \frac{1}{2} - 1} \text{exp}\left(-\frac{\frac{s^2}{2}}{\tau^2}\right) \enspace ,
\end{aligned} %]]></script>
<p>where $\Gamma$ is the <a href="https://en.wikipedia.org/wiki/Gamma_function">gamma function</a>. Absorbing everything that does not depend on $\tau^2$ into the normalizing constant, we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\tau^2 \mid \beta, \pi = 1) &= \frac{1}{Z} \, \left(\tau^2\right)^{-\frac{1}{2} - 1 -\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 - \frac{\frac{s^2}{2}}{\tau^2} \right) \\[0.5em]
&= \frac{1}{Z} \, \left(\tau^2\right)^{-\left(\frac{1}{2} + \frac{1}{2}\right) - 1} \text{exp}\left(-\frac{\left(\frac{s^2}{2} + \frac{\beta^2}{2\sigma_y^2}\right)}{\tau^2}\right) \enspace ,
\end{aligned} %]]></script>
<p>which is a new inverse Gamma distribution:</p>
<script type="math/tex; mode=display">\tau^2 \mid \beta , \pi = 1 \sim \text{Inverse-Gamma}\left(\frac{1}{2} + \frac{1}{2}, \frac{s^2}{2} + \frac{\beta^2}{2\sigma_y^2}\right) \enspace .</script>
<p>On the other hand, if $\pi = 0$, then $\beta = 0$ and we simply sample from the prior:</p>
<script type="math/tex; mode=display">\tau^2 \mid \beta , \pi = 0 \sim \text{Inverse-Gamma}\left(\frac{1}{2}, \frac{s^2}{2}\right) \enspace .</script>
<p>Because the derivation is very similar, we look at the conditional posterior $p(\sigma_e^2 \mid y, \beta)$ next.</p>
<h2 id="conditional-posterior-psigma_e2-mid-y-beta">Conditional posterior $p(\sigma_e^2 \mid y, \beta)$</h2>
<p>Again writing the normalizing constant as $Z$, we expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\sigma_e^2 \mid \mathbf{y}, \beta) &= \frac{1}{Z} \, p(\mathbf{y} \mid \beta, \sigma_e^2)\, p(\beta) \, p(\sigma_e^2) \\[1em]
&= \frac{1}{Z} \, \left(2\pi\sigma_e^2\right)^{-n/2} \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - bx_i\right)^2 \right) \frac{\alpha_2^{\alpha_1}}{\Gamma(\alpha_1)} \left(\sigma_e^2\right)^{- \alpha_1 - 1} \text{exp} \left(-\frac{\alpha_2}{\sigma_e^2}\right) \enspace ,
\end{aligned} %]]></script>
<p>which looks very similar to the conditional posterior on $\tau^2$. In fact, using the same tricks as above — absorbing terms that do not depend on $\sigma_e^2$ into $Z$, and putting terms together — we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\sigma_e^2 \mid \mathbf{y}, \beta) &= \frac{1}{Z} \, \left(\sigma_e^2\right)^{-\frac{n}{2}} \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - bx_i\right)^2 \right) \left(\sigma_e^2\right)^{- \alpha_1 - 1} \text{exp} \left(-\frac{\alpha_2}{\sigma_e^2}\right) \\[1em]
&= \frac{1}{Z} \, \left(\sigma_e^2\right)^{-\left(\alpha_1 + \frac{n}{2}\right) - 1} \text{exp} \left(-\frac{1}{\sigma_e^2} \left[\alpha_2 + \frac{\sum_{i=1}^n(y_i - bx_i)^2}{2}\right]\right) \enspace ,
\end{aligned} %]]></script>
<p>which is again an inverse Gamma distribution:</p>
<script type="math/tex; mode=display">\sigma_e^2 \mid \mathbf{y}, \beta \sim \text{Gamma}\left(\alpha_1 + \frac{n}{2}, \alpha_2 + \frac{\sum_{i=1}^n(y_i - \beta x_i)^2}{2}\right) \enspace .</script>
<p>Contrasting this derivation with the one above, we note something interesting. Our belief about the variance $\sigma_e^2$ gets updated using the $n$ data points $\mathbf{y}$, whereas our belief about $\tau^2$ gets updated using only $\beta$. “In the Bayesian framework, the difference between data and parameters is fuzzy”, McElreath points out (2016, p. 34); or, put even more strongly, Dawid (1979, p.1): “[…] the distinction between data and parameters is largely irrelevant”.</p>
<p>Because the conditional posterior of $\pi$ is quite tricky, we continue with the conditional posterior of $\beta$.</p>
<h2 id="conditional-posterior-pbeta-mid-y-pi-tau2-sigma_e2">Conditional posterior $p(\beta \mid y, \pi, \tau^2, \sigma_e^2)$</h2>
<p>The conditional posterior of $\beta$ given $\pi = 0$ is easy: it is the Dirac delta function $\delta_0$, from which samples will always have value 0. The conditional posterior for $\pi = 1$ is a little more complicated to derive, but not by much. We start by writing:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{p(\mathbf{y} \mid \beta, \pi = 1, \tau^2, \sigma_e^2) \, p(\beta \mid \pi = 1, \tau^2) \, p(\pi = 1) \, p(\tau^2)}{\int p(\mathbf{y} \mid \beta, \pi = 1, \tau^2, \sigma_e^2) \, p(\beta \mid \pi = 1) \, p(\pi = 1) \, p(\tau^2) \, \mathrm{d} \beta} \\[1em]
&= \frac{p(\mathbf{y} \mid \beta, \pi = 1, \tau^2, \sigma_e^2) \, p(\beta \mid \pi = 1)}{\int p(\mathbf{y} \mid \beta, \pi = 1, \tau^2, \sigma_e^2) \, p(\beta \mid \pi = 1) \, \mathrm{d} \beta} \enspace ,
\end{aligned} %]]></script>
<p>where we again write the normalizing constant as $Z$. Expanding, we get:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{1}{Z} \, \left(2\pi\sigma_e^2\right)^{-n/2} \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - \beta x_i\right)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-1/2} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \enspace .
\end{aligned} %]]></script>
<p>We can again absorb terms that do not depend on $\beta$ into $Z$. We proceed:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - \beta x_i\right)^2 \right) \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - \beta x_i\right)^2 -\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \left[\sum_{i=1}^n \left(y_i - \beta x_i\right)^2 +\frac{2\sigma_e^2}{2\sigma_y^2\tau^2} b^2 \right]\right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \left[\sum_{i=1}^n y_i^2 - 2\beta\sum_{i=1}^n y_i x_i + \beta^2 \sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2} \beta^2 \right]\right) \enspace .
\end{aligned} %]]></script>
<p>We can further absorb the $\sum_{i=1}^n y_i^2$ term into $Z$ and put the $\beta^2$ terms together. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \left[\beta^2\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right) - 2\beta\sum_{i=1}^n y_i x_i\right]\right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left[\beta^2 - \frac{2\beta\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right]\right) \enspace .
\end{aligned} %]]></script>
<p>If you have followed my previous blog post (see <a href="https://fdabl.github.io/statistics/Two-Properties.html">here</a>), then you might guess what comes next: completing the square! We expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{1}{Z} \, \text{exp} \left(-\frac{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left[\left(\beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2 - \left(\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2\right]\right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left(\beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2\right)
\enspace ,
\end{aligned} %]]></script>
<p>where have absorbed the last term into the normalizing constant $Z$ because it does not depend on $\beta$. Note that this is the <em>kernel</em> of a Gaussian distribution, which completes our ordeal — which we both enjoy, admit it! — resulting in:</p>
<script type="math/tex; mode=display">% <![CDATA[
\beta \mid \mathbf{y}, \pi, \tau^2, \sigma_e^2 \sim \begin{cases}
\delta_0 & \hspace{1em} \text{if} \hspace{1em} \pi = 0 \\
\mathcal{N}\left(\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2} \right)}, \frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2} \right)}\right) & \hspace{1em} \text{if} \hspace{1em} \pi = 1
\end{cases} %]]></script>
<p>Note again that we take samples from $\delta_0$ to always be zero.</p>
<h2 id="conditional-posterior-ppi-mid-beta-theta-tau2-first-attempt">Conditional posterior $p(\pi \mid \beta, \theta, \tau^2)$: First attempt</h2>
<p>Applying $d$-separation, the graph tells us that $\pi$ is independent of $\mathbf{y}$ given $\beta$:</p>
<script type="math/tex; mode=display">\pi \perp \mathbf{y} \hspace{.4em} \vert\, \beta \enspace .</script>
<p>This means we can expand in the following way:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi \mid \beta, \tau^2, \theta) &=\frac{p(\beta \mid \pi, \tau^2, \theta) \, \, p(\pi \mid \theta) \, p(\theta) \, p(\tau^2)}{p(\beta \mid \pi = 1, \tau^2, \theta) \, \, p(\pi = 1 \mid \theta) \, p(\theta) \, p(\tau^2) + p(\beta \mid \pi = 0, \tau^2, \theta) \, \, p(\pi = 0 \mid \theta) \, p(\theta) \, p(\tau^2)} \\[1em]
&=\frac{p(\beta \mid \pi, \tau^2, \theta) \, \, p(\pi \mid \theta)}{p(\beta \mid \pi = 1, \tau^2, \theta) \, \, p(\pi = 1 \mid \theta) + p(\beta \mid \pi = 0, \tau^2, \theta) \, \, p(\pi = 0 \mid \theta)}
\enspace ,
\end{aligned} %]]></script>
<p>where we could again cancel terms that were common to both the numerator and denominator. From this, it may come as a surprise that this conditional posterior should be harder than the other ones. Let’s tackle the cases where $\pi = 0$ and $\pi = 1$ in turn; the normalizing constant $Z$ is simply their sum.</p>
<p>We start with $\pi = 1$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 1 \mid \beta, \tau^2, \theta) &= \frac{1}{Z} \, p(\beta \mid \pi = 1, \tau^2, \theta) \, \, p(\pi = 1 \mid \theta) \\[1em]
&= \frac{1}{Z} \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta \enspace ,
\end{aligned} %]]></script>
<p>which looks perfectly reasonable. If $\pi = 0$, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 0 \mid \beta, \tau^2, \theta) &= \frac{1}{Z} \, p(\beta \mid \pi = 0, \tau^2, \theta) \, \, p(\pi = 0 \mid \theta) \\[1em]
&= \frac{1}{Z} \, \delta_0 \, (1 - \theta) \enspace ,
\end{aligned} %]]></script>
<p>which looks peculiar. To see how this bites us, we note that:</p>
<script type="math/tex; mode=display">\pi \mid \beta, \tau^2, \theta \sim \text{Bern}\left(\frac{\left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta}{\left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta + \delta_0 \, (1 - \theta)}\right) \enspace .</script>
<p>The issue with this is as follows. Remember that, in the Gibbs sampler, we sample from this conditional posterior using previous samples of $\beta$, $\tau^2$, and $\theta$ — call them $\beta^{\small{\star}}$, $\tau^{2\small{\star}}$, and $\theta^{\small{\star}}$, respectively. The previous value $\beta^{\small{\star}}$ depends on the previous sample for $\pi$, denoted $\pi^{\small{\star}}$, such that if $\pi^{\small{\star}} = 0$ then $\beta^{\small{\star}} = 0$. If this happens in the sampling process — and it will — then we have to evaluate $\delta_0\left(\beta^{\small{\star}}\right)$ which puts infinite mass on $\beta^{\small{\star}} = 0$. This means that the ratio above will become zero, resulting in a new draw for $\pi$ that is $\pi^{\small{\star}} = 0$. However, this in turn means that the new value for $\beta$ will be $\beta^{\small{\star}} = 0$, and the whole spiel repeats. The Gibbs sampler thus gets forever stuck in the region $\beta = 0$, which means that the Markov chain will not converge to the joint posterior distribution.</p>
<p>Before we go back to the drawing board, one might suggest that we could simply set $\delta_0 = 1$, and then carry out the computation needed to draw from the conditional posterior of $\pi$. It runs into the following issue, however. Let $\xi$ be the chance parameter which governs the Bernoulli from which we draw $\pi$. With $\delta_0 = 1$, we have:</p>
<script type="math/tex; mode=display">\xi = \frac{\left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta}{\left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta + (1 - \theta)} \enspace .</script>
<p>Now let us assume the previous draw of $\beta$ was $\beta^{\small{\star}} = 0$. For simplicity, let $\theta = \frac{1}{2}$ and $\sigma_y^2 = 1$. This leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\xi &= \frac{\left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\tau^2} 0^2\right)\frac{1}{2}}{\left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\tau^2} 0^2\right)\frac{1}{2} + \frac{1}{2}} \\
&= \frac{\left(2\pi\tau^2\right)^{-\frac{1}{2}}}{\left(2\pi\tau^2\right)^{-\frac{1}{2}} + 1} \enspace ,
\end{aligned} %]]></script>
<p>which can never become zero, regardless of the data! If $\tau^2 = 1$, for example, then $\xi = 0.285$. Recall that $\tau^2$ is the variance of the prior assigned to $\beta$. The only way for $\xi$ to become zero, i.e., overwhelmingly support the model in which $\beta = 0$, is for $\tau^2$ to become very, very large. This is known as the Jeffreys-Bartlett-Lindley paradox<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>, and it makes sense: a model which assigns all possible values for $\beta$ similar plausibility will make poor predictions. If we had set $\tau^2$ by hand, then we could (artificially) achieve strong support for the null model (not that this is desirable!). However, we have assigned $\tau^2$ a prior, learning its value from data, and so this will practically never happen. Thus, even though $\xi$ approaches zero more closely the larger $\tau^2$, we will effectively never find strong support for the model in which $\beta = 0$.</p>
<p>In sum, we have tried two things to work with the Dirac delta function: (a) take it at face value, and (b) have it return 1 instead of <em>Inf</em>. The first approach lead to our Gibbs sampler getting stuck, only sampling values $\beta = 0$. The second approach lead to a situation in which we will always find bounded support for the model in which $\beta = 0$, regardless of the data. From this, we can easily draw the conclusion that working with the Dirac delta function is a pain! One might therefore be tempted to suggest to <em>stop being so discrete</em>: instead of $\delta_0$, use another Gaussian with a very small variance. This in fact solves the issue, because then instead of evaluating $\delta_0\left(\beta^{\small{\star}}\right)$, which puts infinite mass on $\beta^{\small{\star}} = 0$, we compute the density of $\beta^{\star}$ under a Gaussian distribution; even though it has small variance, it certainly will not return <em>Inf</em>. This is actually the approach by George & McCulloch (19993), who proposed the spike-and-slab prior setup under the name of <em>Stochastic Search Variable Selection</em>. Two issues remain: it may be difficult to choose this small variance in practice, and if it is very small, the Gibbs sampler will still be inefficient. Thus, we have to find another way to get rid of $\delta_0$.</p>
<!-- Yeah, we could do that. However, there are two issues. First, this would mean that we have to choose the variance of the second Gaussian distribution, indicating what "effect size" we deem negligible. This is difficult. Moreover, if the variance is very small, then the Gibbs sampler will still be inefficient. Yes, we could do Hamiltonian Monte Carlo with Stan, but this would be another blog post. The second issue is that, god damn it, sometimes you gotta do what you gotta do. Sure, we could *simplify* the problem, but do we really want to? Is that how NASA put people on the moon? How homo sapiens conquered the world coming from Africa? Do you think anybody ever got anywhere with saying "naaah, this is too hard"? So let's go back to that fucking drawing board and figure this shit out! -->
<h2 id="conditional-posterior-ppi-mid-beta-theta-tau2-second-attempt">Conditional posterior $p(\pi \mid \beta, \theta, \tau^2)$: Second attempt</h2>
<p>In mathematics, it sometimes helps to write things down in a more complicated manner. In our case, we can do so by conditioning on $\mathbf{y}$ and $\sigma_e^2$, even though $\pi$ is independent of them given $\beta$. This might help because we get another likelihood term with which we can play with. We again start with $\pi = 1$, yielding:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 1 \mid \mathbf{y}, \sigma_e^2, \beta, \tau^2, \theta)
&= \frac{1}{Z} \, p(\mathbf{y} \mid \sigma_e^2, \pi = 1, \tau^2, \theta, \beta) \, \, p(\beta \mid \tau^2, \pi = 1, \theta) \, p(\pi = 1 \mid \theta) \\[1em]
&= \frac{1}{Z} \,\left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \theta \enspace .
\end{aligned} %]]></script>
<p>The case where $\pi = 0$ yields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 0 \mid \mathbf{y}, \sigma_e^2, \beta, \tau^2, \theta)
&= \frac{1}{Z} \, p(\mathbf{y} \mid \sigma_e^2, \pi = 0, \tau^2, \theta, \beta) \, \, p(\beta \mid \tau^2, \pi = 0, \theta) \, p(\pi = 0 \mid \theta) \\[1em]
&= \frac{1}{Z} \,\left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \delta_0 (1 - \theta) \enspace .
\end{aligned} %]]></script>
<p>Argh! It did not work. Observe that again $\pi$ would be drawn from a Bernoulli, but with a more complicated chance parameter $\xi$ than above:</p>
<script type="math/tex; mode=display">\pi \mid \mathbf{y}, \sigma_e^2, \beta, \tau^2, \theta \sim \text{Bern}\left(\frac{\text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \theta}{\text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \theta + \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \delta_0 (1 - \theta)}\right) \enspace ,</script>
<p>where the $\left(2\pi\sigma_e^2\right)^{-\frac{n}{2}}$ term cancels. Still, the denominator features the unholy Dirac delta function $\delta_0$ — the bane of our existence — and we run into the same issue as above.</p>
<p>Exhausted, we ask: should we not try to use a continuous spike instead of the discontinuous Dirac delta? No — let us not give up just yet! I was a bit surprised, however, by how difficult it was to find literature that talked about how to handle the Dirac spike. For example, in a review of Bayesian variable selection methods, O’Hara & Sillanpää (2009) mention the continuous but not the discontinuous spike-and-slab setting. I eventually did find a useful reference (Geweke, 1996) through the paper by George & McCulloch (1997). Motivated by the fact that this problem is indeed <em>not impossible to solve</em>, let’s get back to the drawing board!</p>
<!-- Thus, the conditional posterior probability of $\pi$ is -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- p(\pi \mid y, \sigma_e^2, \beta, \tau^2, \theta) &= \frac{\left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\tau^2} \beta^2 \right) \theta}{\left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\tau^2} \beta^2 \right) \theta + \left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \delta_0 (1 - \theta)} \\[1em] -->
<!-- &=\frac{\text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\tau^2} \beta^2 \right) \theta}{\text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\tau^2} \beta^2 \right) \theta + \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \delta_0 (1 - \theta)} \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<h2 id="conditional-posterior-ppi-mid-beta-theta-tau2-third-attempt">Conditional posterior $p(\pi \mid \beta, \theta, \tau^2)$: Third attempt</h2>
<p>You may be surprised to hear that the thing that impedes Bayesian inference most is actually of great help here: <em>integration</em>! Upon reflection, this makes sense. How do we get rid of $\beta$, which itself depends on the unholy Dirac delta function? We integrate it out! Again tackling the case for which $\pi = 0$ first, we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 0 \mid \mathbf{y}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \, p(\mathbf{y} \mid \pi = 0, \sigma_e^2, \tau^2, \theta) \, p(\pi = 1 \mid \theta) \, p(\theta) \, p(\sigma_e^2) \, p(\tau^2) \\[1em]
&= \frac{1}{Z} \, p(\mathbf{y} \mid \pi = 0, \sigma_e^2, \tau^2, \theta) \, p(\pi = 1 \mid \theta) \\[1em]
&= \frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) (1 - \theta)\enspace ,
\end{aligned} %]]></script>
<p>where because $p(\theta)$, $p(\sigma_e^2)$, and $p(\tau^2)$ feature both in the case where $\pi = 0$ and $\pi = 1$, they can be absorbed into $Z$. For $\pi = 1$, the integration bit is a tick more involved. Using the <em>sum</em> and <em>product</em> rules of probability, we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 1 \mid \mathbf{y}, \sigma_e^2, \tau^2, \theta) &=
\frac{1}{Z} \, \int p(\mathbf{y}, \beta \mid \pi = 1, \sigma_e^2, \tau^2, \theta) \, p(\pi = 1 \mid \theta) \, \mathrm{d}\beta \\
&= \frac{1}{Z} \, \int p(\mathbf{y} \mid \pi = 1, \sigma_e^2, \tau^2, \theta) \, p(\beta \mid \pi = 1, \sigma_e^2, \tau^2, \theta) \, p(\pi = 1 \mid \theta) \, \mathrm{d}\beta \\
&= \frac{1}{Z} \, \int \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \theta \, \mathrm{d}\beta \enspace .
\end{aligned} %]]></script>
<p>This integrand very much looks like the expression we had for the conditional posterior of $\beta$, but unnormalized. So we already know that we will get out the normalizing constant of the conditional posterior of $\beta$, in addition to some other stuff. We put everything that does not depend on $\beta$ outside of the integral:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 1 \mid \mathbf{y}, \sigma_e^2, \tau^2, \theta) &=
\frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \theta \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[\sum_{i=1}^n y_i^2 - 2 \beta \sum_{i=1}^n x_i y_i + \beta^2 \sum_{i=1}^n x_i^2 \right] \right) \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \, \mathrm{d}\beta \\[1em]
&= \frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \theta \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[-2 \beta \sum_{i=1}^n x_i y_i + \beta^2 \sum_{i=1}^n x_i^2 \right] \right) \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \, \mathrm{d}\beta \enspace ,
\end{aligned} %]]></script>
<p>where we now only focus on the integrand, call it $A$, because the margins of these pages are too small.<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup>. For the integrand, we do the exact same computation as in the derivation of the conditional posterior on $\beta$, except that when “completing the square”, we cannot cancel the term. Instead, we put it in front of the integral. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A &= \int \text{exp}\left(-\frac{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left[\left(\beta - \frac{\sum_{i=1}^n x_i y_i}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2 - \frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)^2} \right] \right) \, \mathrm{d}\beta \\[1em]
&= \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)^2} \right) \int \text{exp}\left(-\frac{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left(\beta - \frac{\sum_{i=1}^n x_i y_i}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2 \right) \, \mathrm{d}\beta \\[1em]
&= \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \int \text{exp}\left(-\frac{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left(\beta - \frac{\sum_{i=1}^n x_i y_i}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2 \right) \, \mathrm{d}\beta \\[1em]
&= \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(2\pi\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} \enspace ,
\end{aligned} %]]></script>
<p>where the second term of the last line is the normalizing constant of the conditional posterior on $\beta$. Let $\xi$ again be the chance parameter of the Bernoulli distribution from which we draw $\pi$. Then:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
1 - \xi &= \frac{\frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) (1 - \theta)}{\frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \theta \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(2\pi\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} + \frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) (1 - \theta)} \\[1em]
&= \frac{(1 - \theta)}{\left(\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} \theta + (1 - \theta)} \enspace .
\end{aligned} %]]></script>
<p>Note that this arduous adventure got rid of our nemesis, $\delta_0$. After this third and final attempt, we may take a short rest. Here is a visual break:</p>
<p><img src="../assets/img/Amsterdam-visual-break.jpg" /></p>
<p>In the remainder of the blog post, we will (a) implement this in R, (b) generalize it to $p > 1$ variables, and (c) apply it to some real data.</p>
<h2 id="implementation-in-r">Implementation in R</h2>
<p>The code below implements the spike-and-slab regression for $p = 1$ predictors:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="cd">#' Spike-and-Slab Regression using Gibbs Sampling for p = 1 predictors</span><span class="w">
</span><span class="cd">#'</span><span class="w">
</span><span class="cd">#' @param y: vector of responses</span><span class="w">
</span><span class="cd">#' @param x: vector of predictor values</span><span class="w">
</span><span class="cd">#' @param nr_samples: indicates number of samples drawn</span><span class="w">
</span><span class="cd">#' @param a1: parameter a1 of Gamma prior on variance sigma2e</span><span class="w">
</span><span class="cd">#' @param a2: parameter a2 of Gamma prior on variance sigma2e</span><span class="w">
</span><span class="cd">#' @param theta: parameter of prior over mixture weight</span><span class="w">
</span><span class="cd">#' @param burnin: number of samples we discard ('burnin samples')</span><span class="w">
</span><span class="cd">#'</span><span class="w">
</span><span class="cd">#' @returns matrix of posterior samples from parameters pi, beta, tau2, sigma2e, theta</span><span class="w">
</span><span class="n">ss_regress_univ</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4000</span><span class="p">,</span><span class="w"> </span><span class="n">a1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">a2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w">
</span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">nr_burnin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">nr_samples</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># res is where we store the posterior samples</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'pi'</span><span class="p">,</span><span class="w"> </span><span class="s1">'beta'</span><span class="p">,</span><span class="w"> </span><span class="s1">'sigma2'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tau2'</span><span class="p">,</span><span class="w"> </span><span class="s1">'theta'</span><span class="p">)</span><span class="w">
</span><span class="c1"># take the MLE estimate as the values for the first sample</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute these quantities only once</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">var_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_xy</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># we start running the Gibbs sampler</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># first, get all the values of the previous time point</span><span class="w">
</span><span class="n">pi_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">beta_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">sigma2_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">tau2_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w">
</span><span class="n">theta_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">]</span><span class="w">
</span><span class="c1">## Start sampling from the conditional posterior distributions</span><span class="w">
</span><span class="c1">##############################################################</span><span class="w">
</span><span class="c1"># sample theta from a Beta</span><span class="w">
</span><span class="n">theta_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbeta</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">pi_prev</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pi_prev</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample sigma2e from an Inverse Gamma</span><span class="w">
</span><span class="n">sigma2_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">rgamma</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">a1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">a2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">((</span><span class="n">y</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">x</span><span class="o">*</span><span class="n">beta_prev</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample tau2 from an Inverse Gamma</span><span class="w">
</span><span class="n">tau2_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">rgamma</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">pi_prev</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="o">^</span><span class="m">2</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_prev</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="n">var_y</span><span class="p">))</span><span class="w">
</span><span class="c1"># store this as a variable since it gets computed very often</span><span class="w">
</span><span class="n">var_comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">sigma2_new</span><span class="o">/</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample beta from a Gaussian</span><span class="w">
</span><span class="n">beta_mu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_xy</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">var_comb</span><span class="w">
</span><span class="n">beta_var</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sigma2_new</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">var_comb</span><span class="w">
</span><span class="n">beta_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">beta_var</span><span class="p">))</span><span class="w">
</span><span class="c1"># compute chance parameter of the conditional posterior of pi (Bernoulli)</span><span class="w">
</span><span class="n">l0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta_new</span><span class="p">)</span><span class="w">
</span><span class="n">l1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="w">
</span><span class="nf">log</span><span class="p">(</span><span class="n">theta_new</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">.5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">sum_xy</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="n">sigma2_new</span><span class="o">*</span><span class="n">var_comb</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">.5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">beta_var</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample pi from a Bernoulli</span><span class="w">
</span><span class="n">pi_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">l1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">l1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">l0</span><span class="p">)))</span><span class="w">
</span><span class="c1"># add new samples</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">pi_new</span><span class="p">,</span><span class="w"> </span><span class="n">beta_new</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">pi_new</span><span class="p">,</span><span class="w"> </span><span class="n">sigma2_new</span><span class="p">,</span><span class="w"> </span><span class="n">tau2_new</span><span class="p">,</span><span class="w"> </span><span class="n">theta_new</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># remove the first nr_burnin number of samples</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="o">-</span><span class="n">seq</span><span class="p">(</span><span class="n">nr_burnin</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<h2 id="example-application-i">Example application I</h2>
<p>Here, we simply simulate some data to see whether we can recover the coefficient.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gen_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma2e</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">sigma2e</span><span class="p">))</span><span class="w">
</span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="s1">'X'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gen_dat</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">,</span><span class="w"> </span><span class="n">sigma2e</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ss_regress_univ</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">X</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">samples</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## pi beta sigma2 tau2 theta
## [1,] 1 0.2906086 0.6597533 0.90971999 0.7347514
## [2,] 1 0.1211445 0.8227258 0.19379877 0.8147812
## [3,] 1 0.2482826 0.8256208 0.21308479 0.9398529
## [4,] 1 0.2698416 0.8924097 1.27511931 0.2272394
## [5,] 1 0.2569462 0.8575250 9.26546148 0.3319193
## [6,] 1 0.3302473 0.7589350 0.05923922 0.8465538</code></pre></figure>
<p>The samples for $\beta$ are from its marginal distribution, that is, from the distribution weighted by the uncertainty about each model. We can plot this model-averaged posterior:</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-8-1.png" title="plot of chunk unnamed-chunk-8" alt="plot of chunk unnamed-chunk-8" style="display: block; margin: auto;" /></p>
<p>In this case, we have two models:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{M}_0&: \mathbf{y} = \mathbf{0} \\[0.5em]
\mathcal{M}_1&: \mathbf{y} = \mathbf{0} + \beta \mathbf{x} \enspace ,
\end{aligned} %]]></script>
<p>where we, for simplicity, set the intercepts to 0. The dashed grey line indicates the posterior mean for $\beta$ conditional on the model $\mathcal{M}_1$. The dashed black line, on the other hand, indicates the posterior mean for $\beta$ where we have taken the uncertainy across models into account.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">apply</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## pi beta sigma2 tau2 theta
## 0.8420000 0.2292838 0.9456540 18.3779151 0.6181061</code></pre></figure>
<p>From this, we can also compute the posterior inclusion odds, which is $\frac{0.84}{1 - 0.84} = 5.30$. This means that $\mathcal{M}_1$ is about 5 times more likely than $\mathcal{M}_0$. In the short primer on Bayesian inference above, we have noted that computing posterior inclusion probabilities requires assigning a prior distribution to models. This brings with it some subtleties, and we will sketch the issue of assigning priors to models at the end of this blog post. In the next section, we generalize our spike-and-slab Gibbs sampling procedure to $p > 1$ variables.</p>
<!-- One predictor is hardly the common setting in today's high-dimensional world. Luckily, the Gibbs sampling procedure outlined above translates straightforwardly into the multivariable case. In the next section, we discuss how we have to update our conditional posterior distributions in the $p > 1$ setting. We also update the R implementation, and apply the method to a data set with $p = 15$ predictors. -->
<h2 id="allowing-p--1-predictors">Allowing $p > 1$ predictors</h2>
<p>In the case of multiple predictors, the Gibbs sampling procedure changes slightly. We use independent priors over each predictor:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\beta_i &\sim (1 - \pi_i) \, \delta_0 + \pi_i \, \mathcal{N}(0, \sigma_y^2 \tau^2) \\[0.5em]
\pi_i &\sim \text{Bern}(\theta) \\[0.5em]
\theta &\sim \text{Beta}(a, b) \\[0.5em]
\tau^2 &\sim \text{Inverse-Gamma}(\alpha_1, \alpha_2) \enspace ,
\end{aligned} %]]></script>
<p>for all $i \in [1, \ldots, p]$. We again set $a = b = 1$ and $\alpha_1 = \alpha_2 = 0.01$. Note that $\tau^2$ and $\theta$ are common to all predictors. Let $\mathbf{y} \in \mathbb{R}^{n \times 1}$ be an $n$-dimensional row vector; $\mathbf{X} \in \mathbb{R}^{n \times p}$ be a $n \times p$-dimensional matrix; and $\beta \in \mathbb{R}^{p \times 1}$ be a $p$-dimensional row vector. With this notation, the residual sum of squares, which was $\sum_{i=1}^n (y_i - \beta x_i)^2$ previously, becomes $(\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta)$. Similarly, where we previously had $\beta^2$ we now have $\beta^T\beta$.</p>
<!-- The only thing that changes in the conditional posterior distribution is that: **(a)** the conditional posterior of $\theta$ uses all $\pi_i$ as updates, not just one; **(b)** $\beta^2$ and $\sum_{i=1}^n (y_i - \beta x_i)^2$ get replaced with their vector analogues, $\beta^T\beta$ and $(\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta)$; **(d)** the conditional posterior of $\beta$ becomes a $p$-dimensional Gaussian distribution with diagonal covariance matrix; **(e)** the conditional posterior of $\pi_i$ requires -->
<p>In the next sections, I provide the updated conditional posterior distributions, as well as update the R code to handle $p > 1$ predictors. Compared to the univariable case, we simply have to replace the scalar by vector quantities, except for the conditional posteriors on $\pi_i$ — these again require an integration trick. We tackle the conditional posteriors in turn.</p>
<h3 id="conditional-posterior-ptheta-mid-pi-1">Conditional posterior $p(\theta \mid \pi)$</h3>
<p>The conditional posterior of $\theta$ with $p$ predictors is:</p>
<script type="math/tex; mode=display">\theta \mid \pi \sim \text{Beta}\left(a + \sum_{i=1}^p \pi_i, b + \sum_{i=1}^n (1 - \pi_i) \right) \enspace .</script>
<p>Note that while before the posterior mean of $\theta$ was bounded between $1/3$ and $2/3$, the posterior mean is now bounded between $\frac{1}{2 + p}$ and $\frac{1 + p}{2 + p}$.</p>
<h3 id="conditional-posterior-ptau2-mid-beta-pi-1">Conditional posterior $p(\tau^2 \mid \beta, \pi)$</h3>
<p>We again have two cases for $\tau^2$, but they are slightly different compared to the univariable case. We sample from the prior if <em>all</em> $\pi_i$’s are zero. Let $\pi = (\pi_1, \ldots, \pi_p)$ be the vector of mixture weights, and let $\mathbf{0}$ be a vector of zeros of length $p$, then:</p>
<script type="math/tex; mode=display">\tau^2 \mid \beta , \pi \sim \text{Inverse-Gamma}\left(\frac{1}{2} + \frac{\sum_{i=1}^p \pi_i}{2}, \frac{s^2}{2} + \frac{\beta^T\beta}{2\sigma_y^2}\right) \enspace .</script>
<p>Note that $\beta_i = 0$ if $\pi_i = 0$, and that we thus sample from the prior if all $\pi_i$’s are zero.</p>
<h3 id="conditional-posterior-psigma_e2-mid-y-beta-1">Conditional posterior $p(\sigma_e^2 \mid y, \beta)$</h3>
<p>The conditional posterior on $\sigma_e^2$ changes only slightly:</p>
<script type="math/tex; mode=display">\sigma_e^2 \mid \mathbf{y}, \beta \sim \text{Gamma}\left(\alpha_1 + \frac{n}{2}, \alpha_2 + \frac{(\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta)}{2}\right) \enspace .</script>
<h3 id="conditional-posterior-pbeta-mid-y-pi-tau2-sigma_e2-1">Conditional posterior $p(\beta \mid y, \pi, \tau^2, \sigma_e^2)$</h3>
<p>We could write the prior over all $\beta_i$’s as a multivariate Gaussian with a diagonal covariance matrix. With a Gaussian likelihood, this prior is conjugate, such that the conditional posterior on the regression weights $\beta$ is a multivariate Gaussian distribution. We sketch the derivation as it may be interesting in itself. The idea is to write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi, \tau^2, \sigma_e^2) &= \frac{1}{Z} \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) \right) \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^T\beta\right) \\[.5em]
&= \frac{1}{Z} \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[\mathbf{y}^T\mathbf{y} - 2\beta^T\mathbf{X}^T\mathbf{y} + \beta^T\mathbf{X}^T\mathbf{X}\beta\right] -\frac{1}{2\sigma_y^2\tau^2} \beta^T\beta\right) \\[.5em]
&= \frac{1}{Z} \text{exp}\left(-\frac{1}{2} \left[- 2\beta^T\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2} + \beta^T\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2}\beta + \frac{1}{\sigma_y^2\tau^2} \beta^T\beta\right]\right) \\[.5em]
&= \frac{1}{Z} \text{exp}\left(-\frac{1}{2} \left[\beta^T\left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right) \beta - 2\beta^T\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2}\right]\right) \\[.5em]
&= \frac{1}{Z} \text{exp}\left(-\frac{1}{2} \left(\beta - \left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)^{-1}\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2}\right)^T \left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)\left(\beta - \left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)^{-1}\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2}\right)\right) \enspace .
\end{aligned} %]]></script>
<p>Thus, we draw all $\beta_i$’s from:</p>
<script type="math/tex; mode=display">\beta \mid \mathbf{y}, \pi, \tau^2, \sigma_e^2 \sim
\mathcal{N}\left(\left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)^{-1}\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2}, \left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)^{-1}\right) \enspace ,</script>
<p>where we then set the $\beta_i$’s to zero for which $\pi_i = 0$.</p>
<h3 id="conditional-posterior-ppi-mid-beta-theta-tau2">Conditional posterior $p(\pi \mid \beta, \theta, \tau^2)$</h3>
<p>Because the the individual $\pi_i$’s are conditionally independent given $\theta$, the update step is very similar to the univariable case. We compare the case where the $j^{\text{th}}$ element of $\beta$ is zero ($\pi_j = 0$) against the case where it is not zero ($\pi_j = 1$). The other indicator variables, call them $\pi_{-j}$, are whatever their current sample is. Therefore, we need to compute the probability with which $\pi_j = 1$ compared to $\pi_j = 0$, given the same values for $\pi_{-j}$. Let $\xi_j$ denote the probability that we sample $\pi_j = 1$, and let $\beta_{-j}$ denote the vector of regression weights without $\beta_j$, and for which $\beta_i = 0$ if $\pi_i = 0$. We cycle through each $\pi_j$ and compute:</p>
<script type="math/tex; mode=display">\xi_j = \frac{p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta)}{p(\pi_j = 0 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) + p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta)} \enspace .</script>
<p>We then draw $\pi_j$ from a Bernoulli with chance parameter $\xi_j$; we repeat this procedure for all $j = [1, \ldots, p]$ predictors. We start with the $\pi_j = 0$ case for which $\beta_j = 0$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi_j = 0 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \, p(\mathbf{y} \mid \pi_j = 0, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\beta_{-j} \mid \pi_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\pi \mid \theta) \, p(\theta) \, p(\tau^2) \, p(\sigma_e^2) \\[.5em]
&= \frac{1}{Z} \, p(\mathbf{y} \mid \pi_j = 0, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\pi \mid \theta) \\[.5em]
&= \frac{1}{Z} \, \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}_{-j}\beta_{-j}\right)^T\left(\mathbf{y} - \mathbf{X}_{-j}\beta_{-j}\right)
\right) \sum_{i=1}^p \theta^{\pi_i}(1 - \theta)^{1 - \pi_i} \\[.5em]
&= \frac{1}{Z} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}_{-j}\beta_{-j}\right)^T\left(\mathbf{y} - \mathbf{X}_{-j}\beta_{-j}\right)
\right) (1 - \theta) \enspace ,
\end{aligned} %]]></script>
<p>where we have absorbed the terms that appear both in the posterior for $\pi_j = 0$ and $\pi_j = 1$ into $Z$. Note that in the expression above the prediction is done with only $p - 1$ predictor terms, some of which may be zero and others not, depending on the current sample. We could have written this equivalently with $\mathbf{X}\beta$ with the constraint that $\beta_j = 0$.</p>
<p>The expression for $\pi_j = 1$ requires integrating over $\beta_j$. We start with the expression that already has most of the terms in $Z$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \,p(\mathbf{y} \mid \pi_j = 1, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\pi \mid \theta) \\[.5em]
&= \frac{1}{Z} \, \int p(\mathbf{y}, \beta_j \mid \pi_j = 1, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\pi \mid \theta) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, p(\pi \mid \theta) \, \int p(\mathbf{y} \mid \pi_j = 1, \pi_{-j}, \beta_{-j}, \beta_j, \sigma_e^2, \tau^2, \theta) \, p(\beta_j \mid \pi_j, \tau^2) \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, p(\pi \mid \theta) \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) \right) \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \sum_{i=1}^p \theta^{\pi_i}(1 - \theta)^{1 - \pi_i} \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}}\, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}}\, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \enspace .
\end{aligned} %]]></script>
<!-- To single out $\beta_j$ from $\beta$, define $\mathbf{z} = \mathbf{y} - \mathbf{X}_{-j} \beta_{-j}$ -->
<p>To single out $\beta_j$ from $\beta$, define</p>
<script type="math/tex; mode=display">\mathbf{z} = \mathbf{y} - \mathbf{X}_{-j} \beta_{-j} \enspace ,</script>
<p>as the residuals of the regression $\mathbf{y}$ on $\mathbf{X}_{-j}$.<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup> Due to linearity, we can write</p>
<script type="math/tex; mode=display">\left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) = \sum_{i=1}^n \left(z_i - \beta_j x_i\right)^2 \enspace ,</script>
<p>such that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(z_i - \beta_j x_i\right)^2 -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[\sum_{i=1}^n z_i^2 - 2 \beta_j \sum_{i=1}^n z_i x_i + \beta_j^2 \sum_{i=1}^n x_i^2 \right] -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n z_i^2\right)\int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[- 2 \beta_j \sum_{i=1}^n z_i x_i + \beta_j^2 \sum_{i=1}^n x_i^2 \right] -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \mathbf{z}^T\mathbf{z} \right) \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[- 2 \beta_j \sum_{i=1}^n z_i x_i + \beta_j^2 \sum_{i=1}^n x_i^2 \right] -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}_{-j} \beta_{-j}\right)^T\left(\mathbf{y} - \mathbf{X}_{-j} \beta_{-j}\right) \right) \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[- 2 \beta_j \sum_{i=1}^n z_i x_i + \beta_j^2 \sum_{i=1}^n x_i^2 \right] -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \enspace ,
\end{aligned} %]]></script>
<p>which is a very similar integration problem as in the univariable case. The same trick holds here: we remove all terms that do not depend on $\beta_j$ from the integral, complete the square, and find the normalizing constant of a Gaussian. In fact, the steps are exactly the same as above, except that we have $z_i$ instead of $y_i$, and so we just give the solution:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}_{-j} \beta_{-j}\right)^T\left(\mathbf{y} - \mathbf{X}_{-j} \beta_{-j}\right) \right) \, \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i z_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(2\pi\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} \enspace .
\end{aligned} %]]></script>
<p>The conditional posterior of $\pi_j = 0$ is therefore a Bernoulli distribution with (1 minus) chance parameter:</p>
<script type="math/tex; mode=display">1 - \xi_j = \frac{(1 - \theta)}{\left(\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i z_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} \theta + (1 - \theta)} \enspace ,</script>
<p>where $z_j$ changes depending which $\beta_j$ we currently sample.</p>
<h2 id="implementation-in-r-1">Implementation in R</h2>
<p>The implementation changes only slightly:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="cd">#' Spike-and-Slab Regression using Gibbs Sampling for p > 1 predictors</span><span class="w">
</span><span class="cd">#'</span><span class="w">
</span><span class="cd">#' @param y: vector of responses</span><span class="w">
</span><span class="cd">#' @param X: matrix of predictor values</span><span class="w">
</span><span class="cd">#' @param nr_samples: indicates number of samples drawn</span><span class="w">
</span><span class="cd">#' @param a1: parameter a1 of Gamma prior on variance sigma2e</span><span class="w">
</span><span class="cd">#' @param a2: parameter a2 of Gamma prior on variance sigma2e</span><span class="w">
</span><span class="cd">#' @param theta: parameter of prior over mixture weight</span><span class="w">
</span><span class="cd">#' @param burnin: number of samples we discard ('burnin samples')</span><span class="w">
</span><span class="cd">#'</span><span class="w">
</span><span class="cd">#' @returns matrix of posterior samples from parameters pi, beta, tau2, sigma2e, theta</span><span class="w">
</span><span class="n">ss_regress</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">a1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">a2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6000</span><span class="p">,</span><span class="w"> </span><span class="n">nr_burnin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">nr_samples</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w">
</span><span class="c1"># res is where we store the posterior samples</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">p</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s1">'pi'</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="p">)),</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s1">'beta'</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="p">)),</span><span class="w">
</span><span class="s1">'sigma2e'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tau2'</span><span class="p">,</span><span class="w"> </span><span class="s1">'theta'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># take the MLE estimate as the values for the first sample</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">),</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute only once</span><span class="w">
</span><span class="n">XtX</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">X</span><span class="w">
</span><span class="n">Xty</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">y</span><span class="w">
</span><span class="n">var_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">))</span><span class="w">
</span><span class="c1"># we start running the Gibbs sampler</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># first, get all the values of the previous time point</span><span class="w">
</span><span class="n">pi_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="p">)]</span><span class="w">
</span><span class="n">beta_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">p</span><span class="p">)]</span><span class="w">
</span><span class="n">sigma2e_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">tau2_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">theta_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">res</span><span class="p">)]</span><span class="w">
</span><span class="c1">## Start sampling from the conditional posterior distributions</span><span class="w">
</span><span class="c1">##############################################################</span><span class="w">
</span><span class="c1"># sample theta from a Beta</span><span class="w">
</span><span class="n">theta_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbeta</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">),</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pi_prev</span><span class="p">))</span><span class="w">
</span><span class="c1"># sample sigma2e from an Inverse-Gamma</span><span class="w">
</span><span class="n">err</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta_prev</span><span class="w">
</span><span class="n">sigma2e_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">rgamma</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">a1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">a2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">err</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">err</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample tau2 from an Inverse Gamma</span><span class="w">
</span><span class="n">tau2_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">rgamma</span><span class="p">(</span><span class="w">
</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">),</span><span class="w">
</span><span class="n">s</span><span class="o">^</span><span class="m">2</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">beta_prev</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta_prev</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="n">var_y</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample beta from multivariate Gaussian</span><span class="w">
</span><span class="n">beta_cov</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">qr.solve</span><span class="p">((</span><span class="m">1</span><span class="o">/</span><span class="n">sigma2e_new</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">XtX</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">),</span><span class="w"> </span><span class="n">p</span><span class="p">))</span><span class="w">
</span><span class="n">beta_mean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">beta_cov</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">Xty</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="n">sigma2e_new</span><span class="p">)</span><span class="w">
</span><span class="n">beta_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mvtnorm</span><span class="o">::</span><span class="n">rmvnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mean</span><span class="p">,</span><span class="w"> </span><span class="n">beta_cov</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample each pi_j in random order</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># get the betas for which beta_j is zero</span><span class="w">
</span><span class="n">pi0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pi_prev</span><span class="w">
</span><span class="n">pi0</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">bp0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">beta_new</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">pi0</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute the z variables and the conditional variance</span><span class="w">
</span><span class="n">xj</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="p">[,</span><span class="w"> </span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="n">z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">bp0</span><span class="w">
</span><span class="n">cond_var</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">xj</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">sigma2e_new</span><span class="o">/</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">))</span><span class="w">
</span><span class="c1"># compute chance parameter of the conditional posterior of pi_j (Bernoulli)</span><span class="w">
</span><span class="n">l0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta_new</span><span class="p">)</span><span class="w">
</span><span class="n">l1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="w">
</span><span class="nf">log</span><span class="p">(</span><span class="n">theta_new</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">.5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">xj</span><span class="o">*</span><span class="n">z</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="n">sigma2e_new</span><span class="o">*</span><span class="n">cond_var</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">.5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma2e_new</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">cond_var</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample pi_j from a Bernoulli</span><span class="w">
</span><span class="n">pi_prev</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">l1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">l1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">l0</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">pi_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pi_prev</span><span class="w">
</span><span class="c1"># add new samples</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">pi_new</span><span class="p">,</span><span class="w"> </span><span class="n">beta_new</span><span class="o">*</span><span class="n">pi_new</span><span class="p">,</span><span class="w"> </span><span class="n">sigma2e_new</span><span class="p">,</span><span class="w"> </span><span class="n">tau2_new</span><span class="p">,</span><span class="w"> </span><span class="n">theta_new</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># remove the first nr_burnin number of samples</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="o">-</span><span class="n">seq</span><span class="p">(</span><span class="n">nr_burnin</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>We might want to run not only one Markov chain, as <em>ss_regress</em> does, but several; and we might also want to run them in parallel, which is achieved by the following wrapper:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'doParallel'</span><span class="p">)</span><span class="w">
</span><span class="n">registerDoParallel</span><span class="p">(</span><span class="n">cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w">
</span><span class="cd">#' Calls the ss_regress function in parallel</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @params same as ss_regress</span><span class="w">
</span><span class="cd">#' @params nr_cores: numeric, number of cores to run ss_regress in parallel</span><span class="w">
</span><span class="cd">#' @returns a list with nr_cores entries which are posterior samples</span><span class="w">
</span><span class="n">ss_regressm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">a1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">a2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6000</span><span class="p">,</span><span class="w">
</span><span class="n">nr_burnin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">nr_samples</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">nr_cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">foreach</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">nr_cores</span><span class="p">),</span><span class="w"> </span><span class="n">.combine</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rbind</span><span class="p">)</span><span class="w"> </span><span class="o">%dopar%</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">ss_regress</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">a1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">a1</span><span class="p">,</span><span class="w"> </span><span class="n">a2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">a2</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">theta</span><span class="p">,</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">,</span><span class="w">
</span><span class="n">nr_burnin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_burnin</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">samples</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<h2 id="example-application-ii">Example Application II</h2>
<p>We use a data set on (aggregated) attitudes of clerical employees in a large financial organization. We want to predict the overall rating based on answers to seven questions, which are our predictors:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">data</span><span class="p">(</span><span class="n">attitude</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">attitude</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## rating complaints privileges learning raises critical advance
## 1 43 51 30 39 61 92 45
## 2 63 64 51 54 63 73 47
## 3 71 70 68 69 76 86 48
## 4 61 63 45 47 54 84 35
## 5 81 78 56 66 71 83 47
## 6 43 55 49 44 54 49 34</code></pre></figure>
<p>We $z$-standardize our variables which forces the intercept to be zero. We do this because we have, for simplicity, neglected to include an intercept in our Gibbs sampling derivations.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">std</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">sd</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">attitude_z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">attitude</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">std</span><span class="p">)</span><span class="w">
</span><span class="n">yz</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">attitude_z</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">Xz</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">attitude_z</span><span class="p">[,</span><span class="w"> </span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ss_regressm</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">yz</span><span class="p">,</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Xz</span><span class="p">,</span><span class="w"> </span><span class="n">a1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">a2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4000</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">post_means</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span><span class="n">res_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="w">
</span><span class="n">post_means</span><span class="p">[</span><span class="n">grepl</span><span class="p">(</span><span class="s1">'beta'</span><span class="p">,</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">post_means</span><span class="p">))],</span><span class="w">
</span><span class="n">post_means</span><span class="p">[</span><span class="n">grepl</span><span class="p">(</span><span class="s1">'pi'</span><span class="p">,</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">post_means</span><span class="p">))]</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">rownames</span><span class="p">(</span><span class="n">res_table</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">Xz</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res_table</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Post. Mean'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Post. Inclusion'</span><span class="p">)</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">res_table</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Post. Mean Post. Inclusion
## complaints 0.601 0.998
## privileges -0.011 0.319
## learning 0.211 0.692
## raises 0.058 0.425
## critical 0.007 0.286
## advance -0.079 0.418</code></pre></figure>
<p>We can also visualize these results:</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-15-1.png" title="plot of chunk unnamed-chunk-15" alt="plot of chunk unnamed-chunk-15" style="display: block; margin: auto;" /></p>
<p>We are certain to include only the predictor variable <em>complaints</em>. There remains large uncertainty as to whether the other variables are associated, or not associated, with the outcome.</p>
<p>As an aside, there are also other options than specifying independent priors over the $\beta$’s, which is what we have done in our setup. The most popular prior specification is based on Zellner’s (1986) $g$-prior:</p>
<script type="math/tex; mode=display">\beta \mid g \sim \mathcal{N}\left(0, g \, \sigma_y^2 \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\right) \enspace ,</script>
<p>where $g = \tau^2$ in our terminology and which does not have a diagonal covariance matrix but one that is scaled by $\left(\mathbf{X}^T\mathbf{X}\right)^{-1}$. Liang et al. (2008) propose various ways to deal with $g$. One of them, as discussed in this blog post, is to assign $g$ an inverse Gamma distribution which leads to a (multivariate) marginal Cauchy distribution on $\beta$. Som, Hans, & MacEachern (2016) point out an interesting problem that may arise when using, as we have done in this blog post, a single global $g$ or $\tau^2$ parameter. Li & Clyde (2018) unify various approaches in a general framework that extends to generalized linear models.<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup> In the next section, I briefly sketch some subtleties in assigning a prior to models.</p>
<h2 id="prior-on-models">Prior on Models</h2>
<p>We have seen that the Gibbs sampler with spike-and-slab priors can yield model-averaged parameter estimates as well as posterior inclusion probabilities. However, in the first section of this blog post, I have pointed out that this is only possible once we assign priors to models. Have we done so? Yes, albeit implicitly. We have $2^p$ possible models, where a model simply indexes which of the $\pi_i$’s equal 1 and which equal 0. For example, the model with zero predictors has $\pi = \mathbf{0}$, whereas the model which includes all predictors has $\pi = \mathbf{1}$. Thus, a prior assigned to $\pi_i$ constitutes a prior assigned to models. The independent spike-and-slab prior specification described above yields:</p>
<script type="math/tex; mode=display">\begin{aligned}
p(\pi) = \int \prod_{i=1}^m \theta^{\pi_i} (1 - \theta)^{1 - \pi_i} \, p(\theta) \, \mathrm{d}\theta \enspace .
\end{aligned}</script>
<p>In the next two sections, we will discuss the implications of different choices for $p(\theta)$.</p>
<h3 id="uniform-on-models-non-uniform-on-model-size">Uniform on Models, Non-uniform on Model Size</h3>
<p>Let’s focus on the special case $\theta = \frac{1}{2}$ for a moment. This yields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi) &= \prod_{i=1}^m \left(\frac{1}{2}\right)^{\pi_i} \left(1 - \frac{1}{2}\right)^{1 - \pi_i} \\[.5em]
&= \left(\frac{1}{2}\right)^{\sum_{i=1}^p \pi_i} \left(\frac{1}{2}\right)^{p - \sum_{i=1}^p \pi_i} \\[.5em]
&= \frac{1}{2^p} \enspace ,
\end{aligned} %]]></script>
<p>the uniform prior over all models. It may be surprising to hear that this uniform prior over models induces a non-uniform prior on <em>model size</em>. To see this, let’s introduce the new random variable $K = \sum_{i=1}^p \pi_i$, which counts the number of active predictors and thus constitutes the <em>size</em> of a model. Now that we focus on $K$ instead of the individual $\pi_i$’s, we do not care which particular $\pi_i$’s are zero or not, but only how many of them are non-zero. Resultingly, there are ${p \choose k}$ possible ways of obtaining $K = k$ active predictors, and the prior distribution distribution assigned to $K$ becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(K) &= {p \choose k} \left(\frac{1}{2}\right)^{k} \left(\frac{1}{2}\right)^{p - k} \\[.5em]
&= {p \choose k} \frac{1}{2^k} \enspace ,
\end{aligned} %]]></script>
<p>which is a Binomial distribution with $\theta = \frac{1}{2}$, encoding the prior expectation that half of the predictor variables will be included. To further see that a uniform prior over models leads to a non-uniform prior over model size, assume that we have $p = 2$ predictors and thus $m = 2^2 = 4$ models. The uniform prior on models assigns a probability of $\frac{1}{4}$ to all models coded in terms of $\pi$ as $[(0, 0), (1, 0), (0, 1), (1, 1)]$. However, there is only ${2 \choose 0} = {2 \choose 2} = 1$ way to get a model that includes zero or both predictors, while there are ${2 \choose 1} = 2$ ways to get models that include one predictor. Thus, models that are of size one (i.e., either include $\beta_1$ or $\beta_2$) get assigned <em>double</em> the amount of probability mass than models that include zero or both predictors; for a visual illustration, see the figure below.</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-18-1.png" title="plot of chunk unnamed-chunk-18" alt="plot of chunk unnamed-chunk-18" style="display: block; margin: auto;" /></p>
<h3 id="uniform-on-model-size-non-uniform-on-models">Uniform on Model Size, Non-uniform on Models</h3>
<p>We may be uncomfortable with the prior expectation that half of the variables are included, i.e. that $\theta = \frac{1}{2}$. In our spike-and-slab prior specification above, we have instead assigned $\theta$ a Beta prior. This leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi) &= \int \prod_{i=1}^m \theta^{\pi_i} (1 - \theta)^{1 - \pi_i} \, \frac{1}{\text{B}(a, b)} \theta^{a - 1}(1 - \theta)^{b - 1}\, \mathrm{d}\theta \\[.5em]
&= \frac{1}{\text{B}(a, b)} \int \theta^{\sum_{i=1}^p \pi_i + a - 1} (1 - \theta)^{\sum_{i=1}^p (1 - \pi_i) + b - 1} \, \mathrm{d}\theta \\[.5em]
&= \frac{\text{B}\left(a + \sum_{i=1}^p \pi_i, b + \sum_{i=1}^p (1 - \pi_i)\right)}{\text{B}(a, b)} \\[.5em]
&= \frac{\text{B}\left(a + \sum_{i=1}^p \pi_i, b + p - \sum_{i=1}^p \pi_i\right)}{\text{B}(a, b)} \enspace ,
\end{aligned} %]]></script>
<p>where we have recognized the integrand as the kernel of a Beta distribution.</p>
<p>We can again study the implied prior on model size. Using the same intuition as above, the distribution assigned to $K$ becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(K = k) &= {p \choose k} \, \frac{\text{B}(a + k, b + p - k)}{\text{B}(a, b)} \,
\end{aligned} %]]></script>
<p>which is not a Binomial but a <em>Beta-binomial</em> distribution. Assuming again that we have $p = 2$ predictors and thus $m = 2^2 = 4$ models, and that $a = b = 1$ as above, this setup induces a uniform distribution over $K$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">dbetabin</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">beta</span><span class="p">(</span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">beta</span><span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">dbetabin</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 0.3333333 0.3333333 0.3333333</code></pre></figure>
<p>Conversely, this implies a non-uniform prior over models. In particular, this prior setup assigns more mass on extremely sparse or extremely dense models. To see this, note again that there is only ${2 \choose 0} = {2 \choose 2} = 1$ way to get a model that includes zero or both predictors, while there are two ${2 \choose 1} = 2$ ways to get a model that includes one predictor. Thus, models that are of size one (i.e., either include $\beta_1$ or $\beta_2$) get assigned only <em>half</em> as much probability mass than models that include zero or both predictors; for a visual illustration, see the figure below.</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-20-1.png" title="plot of chunk unnamed-chunk-20" alt="plot of chunk unnamed-chunk-20" style="display: block; margin: auto;" /></p>
<p>Especially with a large number of predictors, we might be wary of the assumption that the model which includes no predictor and the model which includes all predictors are the most likely models a priori.<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup> We can think of priors assigned to models and model size as formalizing how <em>sparse</em> we think the part of the world we are modeling is. The wonderful thing about using Bayesian statistics to quantify uncertainty is that these assumptions are out in the open. This by itself does not imply, however, that variable selection seizes to be a difficult and nuanced problem.</p>
<!-- # Issues with Gibbs sampling -->
<!-- Can sample from a joint without it being proper, see Hobert & Casella ([1996](https://www.tandfonline.com/doi/abs/10.1080/01621459.1996.10476714)). -->
<!-- ## Variable selection is harder than you think -->
<!-- Applying the Gibbs sampler using the spike-and-slab prior is also known under *stochastic search variable selection*. At first sight, it looks like a panacea: we get a posterior distribution over all possible $2^p$ models from which, if desired, inclusion Bayes factors can be computed. There are two remaining questions, however, whose answer will disfavour the spike-and-slab as a tool for variable selection in the general regression case. First, what prior over models do we use? Second, is the spike-and-slab prior a good prior, that is, does it fulfill a number of desiderata? [Link](https://www.tandfonline.com/doi/full/10.1080/01621459.2018.1469992) -->
<!-- # Discussion -->
<!-- One issue with the Gibbs sampler is that its efficiency decreases when the variables are correlated. -->
<!-- [data-example](https://cran.r-project.org/web/packages/BAS/vignettes/BAS-vignette.html) -->
<h1 id="conclusion">Conclusion</h1>
<p>If you have stayed with me until the bitter end, awesome! We have covered a lot in this blog post. In particular, we have tackled the problem of variable selection using a Bayesian approach which allowed us to quantify and incorporate uncertainty about parameters as well as models. We have focused on linear regression with spike-and-slab priors and derived a Gibbs sampler for the single and multiple predictor case. Applying this to simulated and real data, we have seen how this leads to model-averaged parameter estimates, as well as uncertainty estimates about whether or not to include a particular predictor variable. Lastly, we have discussed the nuances of assigning priors to models. If you want to read up on any of these topics, I encourage you to check out the references below. Otherwise, hope to see you next month!</p>
<hr />
<p><em>I would like to thank Don van den Bergh, Max Hinne, and Maarten Marsman for discussions about the Gibbs sampler, and Sophia Crüwell for comments on this blog post.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Lindley, Dennis (<a href="https://www.amazon.com/Making-Decisions-2nd-Dennis-Lindley/dp/0471908088">1991</a>). <em>Making Decisions (2 ed.)</em>. New Jersey, US: Wiley.</li>
<li>George, E. I. (<a href="https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2000.10474336">2004</a>). The Variable Selection Problem. <em>Journal of the American Statistical Association, 95</em>(452), 1304-1308.</li>
<li>Clyde, M., & George, E. I. (<a href="https://bit.ly/2uzS91Q">2004</a>). Model uncertainty. <em>Statistical Science,19</em>(1) 81-94.</li>
<li>Hinne, M., Gronau, Q. F., van den Bergh, D., & Wagenmakers, E. J. (<a href="https://psyarxiv.com/wgb64/">2019</a>). A conceptual introduction to Bayesian Model Averaging. doi: 10.31234/osf.io/wgb64.</li>
<li>Robert, C., & Casella, G. (<a href="https://www.jstor.org/stable/23059158">2011</a>). A short history of Markov chain Monte Carlo: Subjective recollections from incomplete data. <em>Statistical Science, 26</em>(1), 102-115.</li>
<li>McElreath, R. (<a href="https://xcelab.net/rm/statistical-rethinking/">2015</a>). <em>Statistical Rethinking: A Bayesian course with examples in R and Stan</em>. London, UK: Chapman and Hall/CRC.</li>
<li>Matthews, R. (<a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9639.00013">2001</a>). Storks deliver babies (p = 0.008). <em>Teaching Statistics, 22</em>(2), 36-38.</li>
<li>Dawid, A. P. (<a href="http://www.jmlr.org/proceedings/papers/v6/dawid10a/dawid10a.pdf">2010</a>). Beware of the DAG! In <em>Proceedings of the NIPS 2008 Workshop on Causality. Journal of Machine Learning Research Workshop and Conference Proceedings, (6)</em> 59–86.</li>
<li>Dawid, A. P. (<a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1979.tb01052.x">1979</a>). Conditional independence in statistical theory. <em>Journal of the Royal Statistical Society: Series B (Methodological), 41</em>(1), 1-15.</li>
<li>Casella, G., & George, E. I. (<a href="https://www.tandfonline.com/doi/abs/10.1080/00031305.1992.10475878">1992</a>). Explaining the Gibbs sampler. <em>The American Statistician, 46</em>(3), 167-174.</li>
<li>George, E. I., & McCulloch, R. E. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1993.10476353">1993</a>). Variable selection via Gibbs Sampling. <em>Journal of the American Statistical Association, 88</em>(423), 881-889.</li>
<li>O’Hara, R. B., & Sillanpää, M. J. (<a href="https://projecteuclid.org/euclid.ba/1340370391">2009</a>). A review of Bayesian variable selection methods: what, how and which. <em>Bayesian Analysis, 4</em>(1), 85-117.</li>
<li>George, E. I., & McCulloch, R. E. (<a href="https://www.jstor.org/stable/24306083">1997</a>). Approaches for Bayesian variable selection. <em>Statistica Sinica, 7</em>(2), 339-373.</li>
<li>Geweke, J. (<a href="https://bit.ly/2Oy5wIV">1994</a>). Variable selection and model comparison in regression. In <em>Bayesian Statistics 5: Proceedings of the 5<sup>th</sup> Valencia International Meeting</em>, 1-30.</li>
<li>Zellner, A. (1986). On Assessing Prior Distributions and Bayesian Regression Analysis With <em>g</em>-Prior Distributions. In <em>Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti</em>, 233-243. The Netherlands, Amsterdam: Elsevier.</li>
<li>Liang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, J. O. (<a href="https://amstat.tandfonline.com/doi/abs/10.1198/016214507000001337">2008</a>). Mixtures of <em>g</em>-priors for Bayesian variable selection. <em>Journal of the American Statistical Association, 103</em>(481), 410-423.</li>
<li>Som, A., Hans, C. M., & MacEachern, S. N. (<a href="https://academic.oup.com/biomet/article-abstract/103/4/993/2659028">2016</a>). A conditional Lindley paradox in Bayesian linear models. <em>Biometrika, 103</em>(4), 993-999.</li>
<li>Li, Y., & Clyde, M. A. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2018.1469992">2018</a>). Mixtures of <em>g</em>-priors in generalized linear models. <em>Journal of the American Statistical Association, 113</em>(524), 1828-1845.</li>
</ul>
<h2 id="footnotes">Footnotes</h2>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>For a very concise overview of variable selection, see George (<a href="https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2000.10474336">2004</a>). For a good overview of model uncertainty, see Clyde & George (<a href="https://bit.ly/2uzS91Q">2004</a>). For a conceptual introduction to model-averaging, see Hinne, Gronau, van den Bergh, & Wagenmakers (2019). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>For mathematical details, see for example Casella & George (<a href="https://www.tandfonline.com/doi/abs/10.1080/00031305.1992.10475878">1992</a>). <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Although I usually try to provide some historical context, this blog post is already quite long. To keep it short, and if you are interested, I recommend you read Robert & Casella (<a href="https://www.jstor.org/stable/23059158?casa_token=vsTr22q7O4sAAAAA:Z-8SrJZeH-pGcKO0uiNArdtQyQhIKLK8BzO4KQ5dDkeuqlR_oBZ5fRVbwpuBwA_SQJ5XANs5NRugrB1QnsMYpMaHovzzvYhoXOsLF7q8qxYrHnIJ7TQ&seq=1#metadata_info_tab_contents">2011</a>). <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>The actual symbol for conditional independence, introduced by Dawid (<a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1979.tb01052.x">1979</a>), differs from $\perp$ in that it has two vertical lines. However, MathJax does not have the correct symbol in its library. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>Dennis Lindley also has a second paradox named after him, see <a href="https://www.bayesianspectacles.org/dennis-lindleys-second-paradox/">here</a> — which is a little tongue in cheek. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>Two things. First, you really shouldn’t read these blog posts on your phone! Second, too small margins might remind you of an expression by Fermat, who used this as a justification to not give a proof of his famous last theorem. I recently read an absolutely captivating book about Fermat’s last theorem which might interest you; see <a href="https://www.goodreads.com/book/show/38412.Fermat_s_Enigma">here</a>. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>It is called <em>regressing $\mathbf{y}$ on $\mathbf{X}$</em> <a href="https://stats.stackexchange.com/questions/207425/why-do-we-say-the-outcome-variable-is-regressed-on-the-predictors">because we project the response on the predictors</a>. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>The regression implementation in the <a href="https://richarddmorey.github.io/BayesFactor/">BayesFactor</a> R package is based on the model selection approach discussed in Liang et al. (2008), while the <a href="https://merliseclyde.github.io/BAS/">BAS</a> R package and <a href="https://jasp-stats.org/">JASP</a> use the framework described in Li & Clyde (2018). You might find it insightful to compare the analysis results we have gotten here with the results when using these packages. See <a href="https://gist.github.com/fdabl/58e9a7d27623ec545cc3d1d5fc3dc600">this</a> gist for a comparison. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>It is generally unlikely that there are many large effects; Gelman uses what he calls the <a href="https://statmodeling.stat.columbia.edu/2017/12/15/piranha-problem-social-psychology-behavioral-economics-button-pushing-model-science-eats/"><em>Piranha argument</em></a> to justify this claim: if there were many large effects, then they would interfere with each other. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian Dablander“Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using spike-and-slab priors and the Gibbs sampler, a computational method to sample from a joint distribution using only conditional distributions. Variable selection is a beast. To slay it, we must draw on ideas from different fields. We have to discuss the basics of Bayesian inference which motivates our principal weapon, the Gibbs sampler. As an instruction manual, we apply it to a simple example: drawing samples from a bivariate Gaussian distribution (for pre-combat exercises, see here). The Gibbs sampler feeds on conditional distributions. To be able to derive those easily, we need to equip ourselves with $d$-separation and directed acyclic graphs (DAGs). Having trained and become stronger, we attack variable selection in the linear regression case using Gibbs sampling with spike-and-slab priors. These priors are special in that they are a discrete mixture of a Dirac delta function — which can shrink regression coefficients exactly to zero — and a Gaussian distribution. We tackle the single predictor case first, and then generalize it to $p > 1$ predictors. For $p$ predictors, the Gibbs sampler with spike-and-slab priors yields a posterior distribution over all possible $2^p$ regression models, an enormous feat. From this, posterior inclusion probabilities and model-averaged parameter estimates follow straightforwardly. To wield this weapon in practice, we implement the method in R and engage in variable selection on simulated and real data. Seems like we have a lot to cover, so let’s get started! Quantifying uncertainty Bayesian inference is an excellent tool for uncertainty quantification. Assume you have assigned a prior distribution to some parameter $\beta$ of a model $\mathcal{M}$, call it $p(\beta \mid \mathcal{M})$. After you have observed data $\mathbf{y}$, how should you update your belief to arrive at the posterior, $p(\beta \mid y, \mathcal{M})$? The rules of probability dictate: The computationally easy parts of the right-hand side is the specification of the prior and, unless you do crazy things, also the likelihood. The tough bit is the marginal likelihood or normalizing constant which, as the name implies, makes the posterior distribution integrate to one, as all proper probability distributions must. In contrast to differentiation, which is a local operation, integration is a global operation and is thus much harder. It becomes even harder with many parameters. Usually, Bayes’ rule is given without conditioning on the model, $\mathcal{M}$. However, this assumes that we know one model to be true with certainty, thus ignoring the uncertainty we have about the models. We can apply Bayes’ rule not only on parameters, but also on models: where $m$ is the number of all models and is in fact the marginal likelihood of our first equation. To illustrate how one could do variable selection, assume we have two models, $\mathcal{M}_1$ and $\mathcal{M}_2$, which differ in their number of predictors: If these two are the only models we consider, then we can quantify their respective merits using posterior odds: where we can interpret the Bayes factor as an indicator for how much more likely the data are under $\mathcal{M}_4$, which includes $\beta_2$, compared to $\mathcal{M}_2$, which does not include $\beta_2$. However, two additional regression models are possible: In general, if $p$ is the number of predictors, then there are $2^p$ possible regression models in total. If we ignore some of those a priori, we will have violated Cromwell’s rule, which states that we should never assign prior probabilities of zero to things that could possibly happen. Otherwise, regardless of the evidence, we would never change our mind. As Dennis Lindley put it, we should “[…] leave a little probability for the moon being made of green cheese; it can be as small as 1 in a million, but have it there since otherwise an army of astronauts returning with samples of the said cheese will leave you unmoved.” (Lindley, 1991, p. 101) One elegant aspect about the Bayes factor is that we do not need to compute the normalizing constant of all models (it cancels in the ratio), which would require us to enumerate and assign priors to all possible models. If we are willing to do this, however, then we can model-average to get a posterior distribution of $\beta_j$ that takes into account the uncertainty about all $m$ models: which requires computing the posterior distribution over the parameter of interest $\beta_j$ in each model $\mathcal{M}_j$, as well as the posterior distribution over all such models. Needless to say, this is a difficult problem; the bulk of this blog post is to find an efficient way to do this in the context of linear regression models. For variable selection, we might be interested in another quantity: the posterior probability that $\beta_j \neq 0$, averaged over all models. We can arrive at this by similar means: Note that conditional on a model $\mathcal{M}_i$, $\beta_j$ is either zero or not zero. Therefore, all the terms in which $\beta_j$ is zero drop out of the sum, and we are left with summing the posterior model probabilities for the models in which $\beta_j \neq 0$. This model-averaging perspective strikes me as a very elegant approach to variable selection.1 In the remainder of this blog post, we will solve this variable selection problem for linear regression using the Gibbs sampler with spike-and-slab priors. Gibbs sampling Much of the advent in Bayesian inference in the last few decades is due to methods that arrive at the posterior distribution without calculating the marginal likelihood. One such method is the Gibbs sampler, which breaks down a high-dimensional problem into a number of smaller low-dimensional problems. It’s really one of the coolest things in statistics: it samples from the joint posterior distribution and its marginals by sampling from the conditional posterior distributions. To prove that it works mathematically is not trivial, and beyond this already lengthy introductory blog post.2 Thus, instead of getting bogged down in the technical details, let’s take a look at a motivating example. Sampling from a bivariate Gaussian To illlustrate, let $X_1$ and $X_2$ be bivariate normally distributed random variables with population mean zero ($\mu_1 = \mu_2 = 0$), unit variance ($\sigma_1^2 = \sigma_2^2 = 1$), and correlation $\rho$. As you may recall from a previous blogpost, the conditional Gaussian distribution of $X_1$ given $X_2 = x_2$, and $X_2$ given $X_1 = x_1$, respectively, are: The Gibbs sampler makes it so that if we sample repeatedly from these two conditional distributions: then these will be samples from the joint distribution $p(X_1, X_2)$ and its marginals. To illustrate, we implement this Gibbs sampler in R. For a very concise overview of variable selection, see George (2004). For a good overview of model uncertainty, see Clyde & George (2004). For a conceptual introduction to model-averaging, see Hinne, Gronau, van den Bergh, & Wagenmakers (2019). ↩ For mathematical details, see for example Casella & George (1992). ↩Two properties of the Gaussian distribution2019-02-28T10:30:00+00:002019-02-28T10:30:00+00:00https://fabiandablander.com/statistics/Two-Properties<!-- In a previous blog post, we looked talked about the method of least squares, a development in statistics Stigler deems as important as calculus for mathematics. -->
<p>In a <a href="https://fdabl.github.io/statistics/Curve-Fitting-Gaussian.html">previous</a> blog post, we looked at the history of least squares, how Gauss justified it using the Gaussian distribution, and how Laplace justified the Gaussian distribution using the central limit theorem. The Gaussian distribution has a number of special properties which distinguish it from other distributions and which make it easy to work with mathematically. In this blog post, I will focus on two of these properties: being closed under (a) <em>marginalization</em> and (b) <em>conditioning</em>. This means that, if one starts with a $p$-dimensional Gaussian distribution and marginalizes out or conditions on one or more of its components, the resulting distribution will still be Gaussian.</p>
<p>This blog post has two parts. First, I will introduce the joint, marginal, and conditional Gaussian distributions for the case of two random variables; an interactive Shiny app illustrates the differences between them. Second, I will show mathematically that the marginal and conditional distribution do indeed have the form I presented in the first part. I will extend this to the $p$-dimensional case, demonstrating that the Gaussian distribution is closed under marginalization and conditioning. This second part is a little heavier on the mathematics, so if you just want to get an intuition you may focus on the first part and simply skip the second part. Let’s get started!</p>
<!-- The figure below shows the *contour lines* of a bivariate Gaussian distribution in blue. This distribution assigns each configuration of the two random variables $X_1$ and $X_2$, i.e., $(x_1, x_2)$, a density. We see that it is somewhat elliptic, which indicates a positive correlation between the variables $X_1$ and $X_2$; therefore, knowing $X_1$ tells us something about $X_2$. If we ignore this information and look at the *marginal* distribution of $X_1$ (the purple line), it looks like a perfectly normal distribution. If we incorporate or *condition* on the information that, in this case, $X_1 = 1.8$, however, we get the conditional distribution (black line). -->
<h1 id="the-land-of-the-gaussians">The Land of the Gaussians</h1>
<p>In the linear regression case discussed <a href="https://fdabl.github.io/statistics/Curve-Fitting-Gaussian.html">previously</a>, we have modeled each individual data point $y_i$ as coming from a <em>univariate conditional</em> Gaussian distribution with mean $\mu = x_i^Tb$ and variance $\sigma^2$. In this blog post, we introduce the random variables $X_1$ and $X_2$ and assume that both are <em>jointly</em> normally distributed; we are going from $p = 1$ to $p = 2$ dimensions. The probability density function changes accordingly — it becomes a function mapping from two to one dimension, i.e., $f: \mathbb{R}^2 \rightarrow \mathbb{R}^+$.</p>
<p>To simplify notation, let $\mathbf{x} = (x_1, x_2)^T$ and $\mathbf{\mu} = (\mu_1, \mu_2)^T$ be two 2-dimensional vectors denoting one observation and the population means, respectively. For simplicity, we set the population means to zero, i.e. $\mathbf{\mu} = (0, 0)$. In one dimension, we had just one parameter for the variance $\sigma^2$; in two dimensions, this becomes a symmetric $2 \times 2$ covariance matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma = \begin{pmatrix}
\sigma_1^2 & \rho \sigma_1 \sigma_2 \\
\rho \sigma_1\sigma_2 & \sigma_2^2
\end{pmatrix} \enspace , %]]></script>
<p>where $\sigma_1^2$ and $\sigma_2^2$ are the population variances of the random variables $X_1$ and $X_2$, respectively, and $\rho$ is the population correlation between the two. The general form of the density function of a $p$-dimensional Gaussian distribution is</p>
<script type="math/tex; mode=display">f(\mathbf{x} \mid \mathbf{\mu}, \Sigma) = (2\pi)^{-p/2} |\Sigma|^{-1/2} \exp \left(-\frac{1}{2} (\mathbf{x} - \mathbf{\mu})^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}) \right) \enspace ,</script>
<p>where $\mathbf{x}$ and $\mathbf{\mu}$ are a $p$-dimensional vectors, $\Sigma^{-1}$ is the $(p \times p)$-dimensional inverse covariance matrix and $|\Sigma|$ is its determinant.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> We focus on the simpler 2-dimensional, zero-mean case. Observe that</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma^{-1} = \frac{1}{|\Sigma|} \begin{pmatrix}\sigma^2_2 & -\rho \sigma_1 \sigma_2 \\ -\rho \sigma_1 \sigma_2 & \sigma^2_1\end{pmatrix} = \frac{1}{\sigma^2_1 \sigma^2_2(1 - \rho^2)} \begin{pmatrix}\sigma^2_2 & -\rho \sigma_1 \sigma_2 \\ -\rho \sigma_1 \sigma_2 & \sigma^2_1\end{pmatrix} \enspace , %]]></script>
<p>which we use to expand the bivariate Gaussian density function:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(x, y \mid \sigma_1^2, \sigma_2^2, \rho) &= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}^T \begin{pmatrix}\sigma^2_2 & -\rho \sigma_1 \sigma_2 \\ -\rho \sigma_1 \sigma_2 & \sigma^2_1\end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \right) \\[1em]
&= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \begin{pmatrix} x_1 \sigma^2_2 -x_2\rho \sigma_1 \sigma_2 \\ x_2 \sigma^2_1 -x_1\rho \sigma_1 \sigma_2 \end{pmatrix}^T \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \right) \\[1em]
&= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2\bigg] \right) \enspace .
\end{aligned} %]]></script>
<p>The figure below plots the <em>contour lines</em> of nine different bivariate normal distributions with mean zero, correlations $\rho \in [0, -0.3, 0.7]$, and standard deviations $\sigma_1, \sigma_2 \in [1, 2]$.</p>
<p><img src="/assets/img/2019-02-28-Two-Properties.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>In the top row, all bivariate Gaussian distributions have $\rho = 0$ and look like a circle for standard deviations of equal size. The top middle plot is stretched along $X_2$, giving it an elliptical shape. The middle and last row show how the distribution changes for negative ($\rho = -0.3$) and positive ($\rho = 0.7$) correlations.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<p>In the remainder of this blog post, we will take a closer look at two operations: marginalization and conditioning. Marginalizing means ignoring, and conditioning means incorporating information. In the zero-mean bivariate case, marginalizing out $X_2$ results in</p>
<script type="math/tex; mode=display">f(x_1) = \frac{1}{\sqrt{2\pi\sigma_1^2}} \text{exp} \left(-\frac{1}{2\sigma_1^2} x_1^2\right) \enspace ,</script>
<p>which is a simple univariate Gaussian distribution with mean $0$ and variance $\sigma_1^2$. On the other hand, incorporating the information that $X_2 = x_2$ results in</p>
<script type="math/tex; mode=display">f(x_1 \mid x_2) = \frac{1}{\sqrt{2\pi\sigma_1^2(1 - \rho^2)}} \text{exp} \left(-\frac{1}{2\sigma_1^2(1 - \rho^2)} \left(x_1 - \rho \frac{\sigma_1}{\sigma_2} x_2\right)^2\right) \enspace ,</script>
<p>which has mean $\rho \frac{\sigma_1}{\sigma_2} x_2$ and variance $\sigma_1^2 (1 - \rho^2)$. The next section provides two simple examples illustrating the difference between these two types of distributions, as well as a simple Shiny app that allows you to build an intuition for conditioning in the bivariate case.</p>
<h1 id="two-examples-and-a-shiny-app">Two examples and a Shiny app</h1>
<p>Let’s illustrate the difference between marginalization and conditioning on two simple examples. First, assume that the correlation is very high with $\rho = 0.8$, and that $\sigma_1^2 = \sigma_2^2 = 1$. Then, observing for example $X_2 = 2$, our belief about $X_1$ changes such that its mean gets shifted to the observed $x_2$ value, i.e. $\mu_1 = 0.8 \cdot 2 = 1.6$ (indicated by the dotted line in the Figure below). The variance of $x_1$ gets substantially reduced, from $1$ to $(1 - 0.8^2) = 0.36$. This is what the left part in the Figure below illustrates. If, on the other hand, $\rho = 0$ such that $X_1$ and $X_2$ are not related, then observing $X_2 = 2$ changes neither the mean of $X_1$ (it stays at zero), nor its variance (it stays at 1); see the right part of the Figure below. Note that the marginal and conditional densities are multiplied with a constant to make them better visible.</p>
<p><img src="/assets/img/2019-02-28-Two-Properties.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>To explore the relation between joint, marginal, and conditional Gaussian distributions, you can play around with a Shiny app following <a href="https://fdabl.shinyapps.io/two-properties/">this</a> link. In the remainder of the blog post, we will prove that the two distributions given above are in fact the marginal and conditional distributions in the two-dimensional case. We will also generalize these results to $p$-dimensional Gaussian distributions.</p>
<h1 id="the-two-rules-of-probability">The two rules of probability</h1>
<p>In the second part of this blog post, we need the two fundamental ‘rules’ of probability: the sum and the product rule.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> The sum rule states that</p>
<script type="math/tex; mode=display">p(x) = \int p(x, y) \, \mathrm{d}y \enspace ,</script>
<p>and the product rule states that</p>
<script type="math/tex; mode=display">p(x, y) = p(x \mid y) \, p(y) = p(y \mid x) \, p(x) \enspace .</script>
<p>In the remainder, we will see that a joint Gaussian distribution can be factorized into a conditional Gaussian and a marginal Gaussian distribution.</p>
<h1 id="property-i-closed-under-marginalization">Property I: Closed under Marginalization</h1>
<p>The first property states that if we <em>marginalize out</em> variables in a multivariate Gaussian distribution, the result is still a Gaussian distribution. The Gaussian distribution is thus <em>closed under marginalization</em>. Below, I will show this for a bivariate Gaussian distribution directly, and for an arbitrary dimensional Gaussian distributions by thinking rather than computing. This illustrates that knowing your definitions can help avoid tedious calculations.</p>
<h2 id="2-dimensional-case">2-dimensional case</h2>
<p>To show that the marginalisation property holds for the bivariate Gaussian distribution, we need to solve the following integration problem</p>
<script type="math/tex; mode=display">\int_{X_2} f(x_1, x_2 \mid \sigma_1^2, \sigma_2^2, \rho) \, \mathrm{d} x_2 \enspace ,</script>
<p>and check whether the result is a univariate Gaussian distribution. We tackle the problem head on and expand</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&\int_{X_2} \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2\bigg] \right) \mathrm{d} x_2 \\[0.5em]
&= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \sigma_2^2 x_1^2\right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[\sigma_1^2 x_2^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2\bigg] \right) \mathrm{d} x_2 \\[1em]
&= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \sigma_2^2 x_1^2\right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \bigg[x_2^2 - 2\rho \frac{\sigma_2}{\sigma_1} x_1 x_2\bigg] \right) \mathrm{d} x_2 \enspace .
\end{aligned} %]]></script>
<p>Putting everything that does not involve $x_2$ outside the integral, we’ve come quite far! Note that we can “complete the square”, that is, write</p>
<script type="math/tex; mode=display">x_2^2 - 2\rho\frac{\sigma_2}{\sigma_1} x_1 x_2 = \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 - \rho^2\frac{\sigma_2^2}{\sigma_1^2} x_1^2 \enspace .</script>
<p>This leads to</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \sigma_2^2 x_1^2\right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \bigg[\left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 - \rho^2\frac{\sigma_2^2}{\sigma_1^2} x_1^2\bigg] \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \sigma_2^2 x_1^2 + \frac{1}{2\sigma^2_2(1 - \rho^2)} \rho^2\frac{\sigma_2^2}{\sigma_1^2} x_1^2 \right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1(1 - \rho^2)} x_1^2 + \frac{1}{2\sigma^2_1(1 - \rho^2)} \rho^2 x_1^2 \right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{x_1^2 - \rho^2 x_1^2}{2\sigma^2_1(1 - \rho^2)} \right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{x_1^2 (1 - \rho^2)}{2\sigma^2_1(1 - \rho^2)} \right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1} x_1^2\right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \enspace .
\end{aligned} %]]></script>
<p>We are nearly done! What’s left is to realize that the integrand is the <em>kernel</em> of a univariate Gaussian distribution with mean $\rho \frac{\sigma_2}{\sigma_1} x_1$ and variance $\sigma_2^2 (1 - \rho^2)$ — it’s an unnormalized <em>conditional</em> Gaussian distribution! The thing that makes a Gaussian distribution integrate to 1, as all distributions must, is the normalizing constant in front, the strange term involving $\pi$. For this particular distribution, the normalizing constant is $\sqrt{2\pi \sigma_2^2 (1 - \rho^2)}$.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p>Continuing, we arrive at the solution</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1} x_1^2\right) \sqrt{2\pi \sigma_2^2 (1 - \rho^2)} \\[0.5em]
&= \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp \left(-\frac{1}{2\sigma^2_1} x_1^2\right) \enspace ,
\end{aligned} %]]></script>
<p>which is the density function of the univariate Gaussian distribution (with mean zero). With some work, we have shown that marginalizing out a variable in a bivariate Gaussian distribution leads to a univariate Gaussian distribution. This process ‘removes’ any occurances of the correlation $\rho$ and the other variable $x_2$.</p>
<p>Granted, this process was rather tedious and not at all general (but good practice!) — does it also work when going from 3 to 2 dimensions? Will the remaining bivariate distribution be Gaussian? What if we go from 200 dimensions to 97 dimensions?</p>
<h2 id="p-dimensional-case">$p$-dimensional case</h2>
<p>A more elegant way to see that a $p$-dimensional Gaussian distribution is closed under marginalization is the following. First, we note the requirement that a random variable needs to fulfill in order to have a (multivariate) Gaussian distribution.</p>
<p><em>Definition.</em> $\mathbf{X} = (X_1, \ldots, X_p)^T$ has a multivariate Gaussian distribution if every linear combination of its components has a (multivariate) Gaussian distribution. Formally,</p>
<script type="math/tex; mode=display">\mathbf{X} \sim \mathcal{N}(\mu, \Sigma) \,\,\,\, \text{if and only if} \,\,\,\, A\mathbf{X} \sim \mathcal{N}(A\mu, A\Sigma A^T) \enspace ,</script>
<p>see for example Blitzstein & Hwang (<a href="https://projects.iq.harvard.edu/stat110">2014</a>, pp. 309-310).</p>
<p>Second, from this it immediately follows that any subset of random variables $H \subset X$ are themselves normally distributed, and the mean and covariance is given by simply ignoring all elements that are not in $H$; this is called the <em>marginalisation</em> property. In particular, we choose a linear transformation that simply ignores the components we want to marginalize out. As an example, let’s take the <em>trivariate</em> Gaussian distribution</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
X_1 \\
X_2 \\
X_3 \\
\end{pmatrix} \sim \mathcal{N}\left(
\begin{pmatrix}
\mu_1 \\
\mu_2 \\
\mu_3 \\
\end{pmatrix},
\begin{pmatrix}
\sigma_1^2 & & \\
\rho_{12}\sigma_1\sigma_2 & \sigma_2^2 & \\
\rho_{13}\sigma_1\sigma_3 & \rho_{23}\sigma_2\sigma_3 & \sigma_3^2 \\
\end{pmatrix}
\right) \enspace , %]]></script>
<p>which has a three-dimensional mean vector and adds a variance and two correlations to the (symmetric) covariance matrix. Define</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix} \enspace , %]]></script>
<p>which picks out the components $X_1$ and $X_2$ and ignores $X_3$. Putting this into the equality from the definition, we arrive at</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix}
\begin{pmatrix}
X_1 \\
X_2 \\
X_3
\end{pmatrix} &\sim \mathcal{N}\left(
\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix}
\begin{pmatrix}
\mu_1 \\
\mu_2 \\
\mu_3
\end{pmatrix},
\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix}
\begin{pmatrix}
\sigma_1^2 & & \\
\rho_{12}\sigma_1\sigma_2 & \sigma_2^2 & \\
\rho_{13}\sigma_1\sigma_3 & \rho_{23}\sigma_2\sigma_3 & \sigma_3^2 \\
\end{pmatrix}
\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix}
\right) \\[1em]
\begin{pmatrix}
X_1 \\
X_2
\end{pmatrix} &\sim \mathcal{N}\left(
\begin{pmatrix}
\mu_1 \\
\mu_2
\end{pmatrix},
\begin{pmatrix}
\sigma_1^2 & \\
\rho_{12}\sigma_1\sigma_2 & \sigma_2^2
\end{pmatrix}
\right) \enspace .
\end{aligned} %]]></script>
<p>Two points to wrap up. First, it helps to know your definitions. Second, in the Gaussian case, computing marginal distributions is trivial. Conditional distributions are a bit harder, unfortunately. But not by much.</p>
<!-- # Property II: Conditionals are Gaussians -->
<h1 id="property-ii-closed-under-conditioning">Property II: Closed under Conditioning</h1>
<p>Conditioning means incorporating information. The fact that Gaussian distributions are closed under conditioning means that, if we start with a Gaussian distribution and update our knowledge given the observed value of one of its components, then the resulting distribution is still Gaussian — we never have to leave the wonderful land of the Gaussians! In the following, we prove this first for the simple bivariate case, which should also give some intuition as to how conditioning differs from marginalizing, and then provide the more general expression for $p$ dimensions.</p>
<p>Instead of ignoring information, as we did when computing marginal distributions above, we now want to incorporate information we have about the other random variable $X_2$. Conditioning implies <em>learning</em>: how does our knowledge that $X_2 = x_2$ change our knowledge about $X_1$?</p>
<h2 id="2-dimensional-case-1">2-dimensional case</h2>
<p>Let’s say we observe $X_2 = x_2$. How does that change our beliefs about $X_1$? The product rule above leads to Bayes’ rule (via simple division), which is exactly what we need:</p>
<script type="math/tex; mode=display">f(x_1 \mid x_2) = \frac{f(x_1, x_2)}{f(x_2)} \enspace ,</script>
<p>where we have suppressed conditioning on the parameters $\rho, \sigma_1^2, \sigma_2^2$ to avoid cluttered notation. Let’s do some algebra! We write</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(x_1 \mid x_2) &= \frac{\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2\bigg] \right)}{\frac{1}{\sqrt{2\pi\sigma_2^2}} \exp \left( -\frac{1}{2\sigma_2^2} x_2^2\right)} \\[1em]
&= \frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2\bigg] + \frac{1}{2\sigma_2^2} x_2^2 \right) \enspace ,
\end{aligned} %]]></script>
<p>which already looks promising. Putting the $x_2^2$ term into the angular brackets, we should see a nice quadratic formula:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2 - \frac{2\sigma^2_1 \sigma^2_2(1 - \rho^2)}{2\sigma_2^2} x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2 - \sigma^2_1 (1 - \rho^2) x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2 - \sigma_1^2 x_2^2 + \sigma_1^2 \rho^2 x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma^2_1 \rho^2 x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 (1 - \rho^2)} \bigg[x_1^2 - 2\rho \frac{\sigma_1}{\sigma_2} x_1 x_2 + \frac{\sigma^2_1}{\sigma^2_2 } \rho^2 x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 (1 - \rho^2)} \left(x_1 - \rho \frac{\sigma_1}{\sigma_2} x_2 \right)^2 \right) \enspace .
\end{aligned} %]]></script>
<p>Done! The conditional distribution has a mean of $\rho \frac{\sigma_1}{\sigma_2}x_2$ and a variance of $\sigma_1^2(1 - \rho^2)$. How does this look like in $p$ dimensions?</p>
<h2 id="p-dimensional-case-1">$p$-dimensional case</h2>
<p>We need a little bit more notation for the crazy ride we’re about to embark on. Let $\mathbf{x} = (x_1, \ldots, x_n)^T$ be an $n$-dimensional vector and $\mathbf{y} = (y_1, \ldots, y_m)^T$ an $m$-dimensional vector which both are jointly Gaussian distributed with covariance matrix $\Sigma \in \mathbb{R}^{(n + m) \times (n + m)}$. Note that we can write $\Sigma$ as a block matrix, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma = \begin{pmatrix}
\Sigma_{xx} & \Sigma_{xy} \\
\Sigma_{yx} & \Sigma_{yy}
\end{pmatrix} \enspace , %]]></script>
<p>where $\Sigma_{xx}$ and $\Sigma_{yy}$ are the covariance matrices of $\mathbf{x}$ and $\mathbf{y}$, respectively, and $\Sigma_{xy} = (\Sigma_{yx})^T$ gives the covariance between $\mathbf{x}$ and $\mathbf{y}$. We remember the density function of a multivariate Gaussian distribution from above, and take a first stab:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(\mathbf{x} \mid \mathbf{y}) &= \frac{f(\mathbf{x}, \mathbf{y})}{f(\mathbf{y})}\\[.5em]
&= \frac{(2\pi)^{-(n + m) / 2} |\Sigma|^{-1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}^T \Sigma^{-1} \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}\right]\right)}{(2\pi)^{-m/2}|\Sigma_{yy}|^{-1/2}\text{exp} \left(-\frac{1}{2} \left[\mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right)} \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}^T \begin{pmatrix}
\Sigma_{xx} & \Sigma_{xy} \\
\Sigma_{yx} & \Sigma_{yy}
\end{pmatrix}^{-1} \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}\right] + \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}^T \begin{pmatrix}
\Sigma_{xx} & \Sigma_{xy} \\
\Sigma_{yx} & \Sigma_{yy}
\end{pmatrix}^{-1} \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix} - \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right) \enspace .
\end{aligned} %]]></script>
<p>There’s only a slight problem. The inverse of the block matrix is pretty <a href="https://en.wikipedia.org/wiki/Invertible_matrix#Blockwise_inversion">ugly</a>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma^{-1} = \begin{pmatrix}
\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx} \right)^{-1} & -\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1} \\
-\Sigma_{yy}^{-1}\Sigma_{yx}\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1} & \Sigma_{yy}^{-1} + \Sigma_{yy}^{-1}\Sigma{yx}\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}
\end{pmatrix} \enspace , %]]></script>
<p>where $\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}$ is the <a href="https://en.wikipedia.org/wiki/Schur_complement">Schur complement</a> of $\Sigma_{xx}$ in the block matrix above. Let’s be lazy and delay computation by simply renaming the relevant parts, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
\Omega = \begin{pmatrix}
\Omega_{xx} & \Omega_{xy} \\
\Omega_{yx} & \Omega_{yy}
\end{pmatrix} = \Sigma^{-1} \enspace . %]]></script>
<p>Proceeding bravely, we write</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(\mathbf{x} \mid \mathbf{y}) &= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}^T \begin{pmatrix}
\Omega_{xx} & \Omega_{xy} \\
\Omega_{yx} & \Omega_{yy}
\end{pmatrix} \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix} - \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}
\mathbf{x}^T \Omega_{xx} + \mathbf{y}^T \Omega_{yx} \\ \mathbf{x}^T \Omega_{xy} + \mathbf{y}^T \Omega_{yy}
\end{pmatrix}^T \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix} - \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[
\mathbf{x}^T \Omega_{xx} \mathbf{x} + \mathbf{y}^T \Omega_{yx} \mathbf{x} + \mathbf{x}^T \Omega_{xy} \mathbf{y} + \mathbf{y}^T \Omega_{yy} \mathbf{y} - \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[
\mathbf{x}^T \Omega_{xx} \mathbf{x} + 2\mathbf{x}^T \Omega_{xy} \mathbf{y} + \mathbf{y}^T \left(\Omega_{yy} - \Sigma_{yy}^{-1}\right) \mathbf{y}\right]\right) \enspace ,
\end{aligned} %]]></script>
<p>where we get the last line by noting that $\mathbf{y}^T \Omega_{yx} \mathbf{x} = \left(\mathbf{x}^T \Omega_{xy} \mathbf{y}\right)^T$, i.e. they give the same scalar. It is also important to keep in mind that $ \Omega_{yy} \neq \Sigma_{yy}^{-1}$.</p>
<p>There is hope: we are in an analogous situation as in the two-dimensional case described above. Somehow we must be able to ‘‘complete the square’’ in the more general $p$-dimensional case, too.</p>
<p>Scribbling on paper for a bit, we dare to conjecture that the conditional distribution is</p>
<script type="math/tex; mode=display">f(\mathbf{x} \mid \mathbf{y}) = (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2}
\left(\mathbf{x} + \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y}\right)^T\Omega_{xx}\left(\mathbf{x} + \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} \right)\right) \enspace ,</script>
<p>which expands into</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2}
\left[\mathbf{x}^T\Omega_{xx}\mathbf{x} + \mathbf{x}^T \Omega_{xx} \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} + \mathbf{y}^T \Omega_{xy}^{T} \Omega_{xx}^{-T} \Omega_{xx}\mathbf{x} + \mathbf{y}^T \Omega_{xy}^T \Omega_{xx}^{-T} \Omega_{xx} \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} \right]\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2}
\left[\mathbf{x}^T\Omega_{xx}\mathbf{x} + 2\mathbf{x}^T \Omega_{xy} \mathbf{y} + \mathbf{y}^T \Omega_{yx} \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} \right]\right) \enspace .
\end{aligned} %]]></script>
<p>For our conjecture to be true, it must hold that</p>
<script type="math/tex; mode=display">\Omega_{yy} - \Sigma_{yy}^{-1} = \Omega_{yx} \Omega_{xx}^{-1} \Omega_{xy} \enspace .</script>
<p>Indeed, remember that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\Sigma^{-1} &= \begin{pmatrix}
\Omega_{xx} & \Omega_{xy} \\
\Omega_{yx} & \Omega_{yy}
\end{pmatrix} \\[1em]
&= \begin{pmatrix}
\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx} \right)^{-1} & -\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1} \\
-\Sigma_{yy}^{-1}\Sigma_{yx}\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1} & \Sigma_{yy}^{-1} + \Sigma_{yy}^{-1}\Sigma{yx}\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}
\end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>and therefore</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\Omega_{yx} \Omega_{xx}^{-1} \Omega_{xy} &= -\Sigma_{yy}^{-1}\Sigma_{yx}\overbrace{\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}
\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}\right)}^{I} \left(
-\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}\right) \\[1em]
&= -\Sigma_{yy}^{-1}\Sigma_{yx} \left(-\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}\right) \\[1em]
&= \Sigma_{yy}^{-1}\Sigma_{yx} \left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1} \\[1em]
&= \underbrace{\Sigma_{yy}^{-1} + \Sigma_{yy}^{-1}\Sigma_{yx} \left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}}_{\Omega_{yy}} - \Sigma_{yy}^{-1} \enspace .
\end{aligned} %]]></script>
<p>This means that we have correctly completed the square! To clean up the business of the determinants, note that the determinant of a block matrix <a href="https://en.wikipedia.org/wiki/Determinant#Block_matrices">factors</a> such that</p>
<script type="math/tex; mode=display">|\Sigma| = |\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}| \times |\Sigma_{yy}| \enspace .</script>
<p>Substituting this into our equation for the conditional density, as well as substituting all the $\Omega$’s with $\Sigma$’s, results in</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(\mathbf{x} \mid \mathbf{y}) &= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2}
\left(\mathbf{x} + \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y}\right)^T\Omega_{xx}\left(\mathbf{x} + \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} \right)\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}|^{-1/2} \text{exp} \left(-\frac{1}{2}
\left(\mathbf{x} - \Sigma_{xy} \Sigma_{yy}^{-1} \mathbf{y}\right)^T \left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}\right)^{-1}\left(\mathbf{x} - \Sigma_{xy} \Sigma_{yy}^{-1} \mathbf{y}\right)\right) \enspace .
\end{aligned} %]]></script>
<p>Thus, if $(\mathbf{x}, \mathbf{y})$ are jointly normally distributed, then incorporating the information that $Y = \mathbf{y}$ leads to a conditional distribution $f(\mathbf{x} \mid \mathbf{y})$ that is Gaussian with conditional mean $\Sigma_{xy} \Sigma_{yy}^{-1} \mathbf{y}$ and conditional covariance $\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}$.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen that the Gaussian distribution has two important properties: it is closed under (a) <em>marginalization</em> and (b) <em>conditioning</em>. For the bivariate case, an accompanying <a href="https://fdabl.shinyapps.io/two-properties/">Shiny app</a> hopefully helped to build some intuition about the difference between these two operations.</p>
<p>For the general $p$-dimensional case, we noted that a random variable <em>per definition</em> follows a (multivariate) Gaussian distribution if and only if every linear combination of its components follows a Gaussian distribution. This made it obvious that the Gaussian distribution is closed under marginalization — we simply ignore the components we want to marginalize over in the linear combination.</p>
<p>To show that an arbitrary dimensional Gaussian distribution is closed under conditioning, we had to rely on a mathematical trick called ‘‘completing the square’’, as well as certain properties of matrices few mortals can remember. In conclusion, I think we should celebrate the fact that frequent operations such as marginalizing and conditioning do not expel us from the wonderful land of the Gaussians.<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></p>
<hr />
<p><em>I would like to thank Don van den Bergh and Sophia Crüwell for helpful comments on this blogpost.</em></p>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>$\Sigma^{-1}$ is the main object of interest in Gaussian graphical models. This is because of another special property of the Gaussian: if the off-diagonal element $(i, j)$ in $\Sigma^{-1}$ is zero, then the variables $X_i$ and $X_j$ are <em>conditionally independent</em> given all the other variables — there is no edge between those two variables in the graph. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>You might enjoy training your intuitions about correlations on <a href="http://guessthecorrelation.com/">http://guessthecorrelation.com/</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>See also Dennis Lindley’s paper <a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9884.00238"><em>The philosophy of statistics</em></a>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>This follows from the Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2} = \sqrt{\pi}$, see <a href="https://en.wikipedia.org/wiki/Gaussian_integral">here</a>. For more on why $\pi$ and $e$ feature in the Gaussian density, see <a href="https://math.stackexchange.com/questions/28558/what-do-pi-and-e-stand-for-in-the-normal-distribution-formula">this</a>. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The land of the Gaussians is vast: its inhabitants — all Gaussian distributions — are also closed under multiplication and convolution. This might make for a future blog post. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderIn a previous blog post, we looked at the history of least squares, how Gauss justified it using the Gaussian distribution, and how Laplace justified the Gaussian distribution using the central limit theorem. The Gaussian distribution has a number of special properties which distinguish it from other distributions and which make it easy to work with mathematically. In this blog post, I will focus on two of these properties: being closed under (a) marginalization and (b) conditioning. This means that, if one starts with a $p$-dimensional Gaussian distribution and marginalizes out or conditions on one or more of its components, the resulting distribution will still be Gaussian. This blog post has two parts. First, I will introduce the joint, marginal, and conditional Gaussian distributions for the case of two random variables; an interactive Shiny app illustrates the differences between them. Second, I will show mathematically that the marginal and conditional distribution do indeed have the form I presented in the first part. I will extend this to the $p$-dimensional case, demonstrating that the Gaussian distribution is closed under marginalization and conditioning. This second part is a little heavier on the mathematics, so if you just want to get an intuition you may focus on the first part and simply skip the second part. Let’s get started! The Land of the Gaussians In the linear regression case discussed previously, we have modeled each individual data point $y_i$ as coming from a univariate conditional Gaussian distribution with mean $\mu = x_i^Tb$ and variance $\sigma^2$. In this blog post, we introduce the random variables $X_1$ and $X_2$ and assume that both are jointly normally distributed; we are going from $p = 1$ to $p = 2$ dimensions. The probability density function changes accordingly — it becomes a function mapping from two to one dimension, i.e., $f: \mathbb{R}^2 \rightarrow \mathbb{R}^+$. To simplify notation, let $\mathbf{x} = (x_1, x_2)^T$ and $\mathbf{\mu} = (\mu_1, \mu_2)^T$ be two 2-dimensional vectors denoting one observation and the population means, respectively. For simplicity, we set the population means to zero, i.e. $\mathbf{\mu} = (0, 0)$. In one dimension, we had just one parameter for the variance $\sigma^2$; in two dimensions, this becomes a symmetric $2 \times 2$ covariance matrix where $\sigma_1^2$ and $\sigma_2^2$ are the population variances of the random variables $X_1$ and $X_2$, respectively, and $\rho$ is the population correlation between the two. The general form of the density function of a $p$-dimensional Gaussian distribution is where $\mathbf{x}$ and $\mathbf{\mu}$ are a $p$-dimensional vectors, $\Sigma^{-1}$ is the $(p \times p)$-dimensional inverse covariance matrix and $|\Sigma|$ is its determinant.1 We focus on the simpler 2-dimensional, zero-mean case. Observe that which we use to expand the bivariate Gaussian density function: The figure below plots the contour lines of nine different bivariate normal distributions with mean zero, correlations $\rho \in [0, -0.3, 0.7]$, and standard deviations $\sigma_1, \sigma_2 \in [1, 2]$. In the top row, all bivariate Gaussian distributions have $\rho = 0$ and look like a circle for standard deviations of equal size. The top middle plot is stretched along $X_2$, giving it an elliptical shape. The middle and last row show how the distribution changes for negative ($\rho = -0.3$) and positive ($\rho = 0.7$) correlations.2 In the remainder of this blog post, we will take a closer look at two operations: marginalization and conditioning. Marginalizing means ignoring, and conditioning means incorporating information. In the zero-mean bivariate case, marginalizing out $X_2$ results in which is a simple univariate Gaussian distribution with mean $0$ and variance $\sigma_1^2$. On the other hand, incorporating the information that $X_2 = x_2$ results in which has mean $\rho \frac{\sigma_1}{\sigma_2} x_2$ and variance $\sigma_1^2 (1 - \rho^2)$. The next section provides two simple examples illustrating the difference between these two types of distributions, as well as a simple Shiny app that allows you to build an intuition for conditioning in the bivariate case. Two examples and a Shiny app Let’s illustrate the difference between marginalization and conditioning on two simple examples. First, assume that the correlation is very high with $\rho = 0.8$, and that $\sigma_1^2 = \sigma_2^2 = 1$. Then, observing for example $X_2 = 2$, our belief about $X_1$ changes such that its mean gets shifted to the observed $x_2$ value, i.e. $\mu_1 = 0.8 \cdot 2 = 1.6$ (indicated by the dotted line in the Figure below). The variance of $x_1$ gets substantially reduced, from $1$ to $(1 - 0.8^2) = 0.36$. This is what the left part in the Figure below illustrates. If, on the other hand, $\rho = 0$ such that $X_1$ and $X_2$ are not related, then observing $X_2 = 2$ changes neither the mean of $X_1$ (it stays at zero), nor its variance (it stays at 1); see the right part of the Figure below. Note that the marginal and conditional densities are multiplied with a constant to make them better visible. To explore the relation between joint, marginal, and conditional Gaussian distributions, you can play around with a Shiny app following this link. In the remainder of the blog post, we will prove that the two distributions given above are in fact the marginal and conditional distributions in the two-dimensional case. We will also generalize these results to $p$-dimensional Gaussian distributions. The two rules of probability In the second part of this blog post, we need the two fundamental ‘rules’ of probability: the sum and the product rule.3 The sum rule states that and the product rule states that In the remainder, we will see that a joint Gaussian distribution can be factorized into a conditional Gaussian and a marginal Gaussian distribution. Property I: Closed under Marginalization The first property states that if we marginalize out variables in a multivariate Gaussian distribution, the result is still a Gaussian distribution. The Gaussian distribution is thus closed under marginalization. Below, I will show this for a bivariate Gaussian distribution directly, and for an arbitrary dimensional Gaussian distributions by thinking rather than computing. This illustrates that knowing your definitions can help avoid tedious calculations. 2-dimensional case To show that the marginalisation property holds for the bivariate Gaussian distribution, we need to solve the following integration problem and check whether the result is a univariate Gaussian distribution. We tackle the problem head on and expand Putting everything that does not involve $x_2$ outside the integral, we’ve come quite far! Note that we can “complete the square”, that is, write This leads to We are nearly done! What’s left is to realize that the integrand is the kernel of a univariate Gaussian distribution with mean $\rho \frac{\sigma_2}{\sigma_1} x_1$ and variance $\sigma_2^2 (1 - \rho^2)$ — it’s an unnormalized conditional Gaussian distribution! The thing that makes a Gaussian distribution integrate to 1, as all distributions must, is the normalizing constant in front, the strange term involving $\pi$. For this particular distribution, the normalizing constant is $\sqrt{2\pi \sigma_2^2 (1 - \rho^2)}$.4 Continuing, we arrive at the solution which is the density function of the univariate Gaussian distribution (with mean zero). With some work, we have shown that marginalizing out a variable in a bivariate Gaussian distribution leads to a univariate Gaussian distribution. This process ‘removes’ any occurances of the correlation $\rho$ and the other variable $x_2$. Granted, this process was rather tedious and not at all general (but good practice!) — does it also work when going from 3 to 2 dimensions? Will the remaining bivariate distribution be Gaussian? What if we go from 200 dimensions to 97 dimensions? $p$-dimensional case A more elegant way to see that a $p$-dimensional Gaussian distribution is closed under marginalization is the following. First, we note the requirement that a random variable needs to fulfill in order to have a (multivariate) Gaussian distribution. Definition. $\mathbf{X} = (X_1, \ldots, X_p)^T$ has a multivariate Gaussian distribution if every linear combination of its components has a (multivariate) Gaussian distribution. Formally, see for example Blitzstein & Hwang (2014, pp. 309-310). Second, from this it immediately follows that any subset of random variables $H \subset X$ are themselves normally distributed, and the mean and covariance is given by simply ignoring all elements that are not in $H$; this is called the marginalisation property. In particular, we choose a linear transformation that simply ignores the components we want to marginalize out. As an example, let’s take the trivariate Gaussian distribution which has a three-dimensional mean vector and adds a variance and two correlations to the (symmetric) covariance matrix. Define which picks out the components $X_1$ and $X_2$ and ignores $X_3$. Putting this into the equality from the definition, we arrive at Two points to wrap up. First, it helps to know your definitions. Second, in the Gaussian case, computing marginal distributions is trivial. Conditional distributions are a bit harder, unfortunately. But not by much. Property II: Closed under Conditioning Conditioning means incorporating information. The fact that Gaussian distributions are closed under conditioning means that, if we start with a Gaussian distribution and update our knowledge given the observed value of one of its components, then the resulting distribution is still Gaussian — we never have to leave the wonderful land of the Gaussians! In the following, we prove this first for the simple bivariate case, which should also give some intuition as to how conditioning differs from marginalizing, and then provide the more general expression for $p$ dimensions. Instead of ignoring information, as we did when computing marginal distributions above, we now want to incorporate information we have about the other random variable $X_2$. Conditioning implies learning: how does our knowledge that $X_2 = x_2$ change our knowledge about $X_1$? 2-dimensional case Let’s say we observe $X_2 = x_2$. How does that change our beliefs about $X_1$? The product rule above leads to Bayes’ rule (via simple division), which is exactly what we need: where we have suppressed conditioning on the parameters $\rho, \sigma_1^2, \sigma_2^2$ to avoid cluttered notation. Let’s do some algebra! We write which already looks promising. Putting the $x_2^2$ term into the angular brackets, we should see a nice quadratic formula: Done! The conditional distribution has a mean of $\rho \frac{\sigma_1}{\sigma_2}x_2$ and a variance of $\sigma_1^2(1 - \rho^2)$. How does this look like in $p$ dimensions? $p$-dimensional case We need a little bit more notation for the crazy ride we’re about to embark on. Let $\mathbf{x} = (x_1, \ldots, x_n)^T$ be an $n$-dimensional vector and $\mathbf{y} = (y_1, \ldots, y_m)^T$ an $m$-dimensional vector which both are jointly Gaussian distributed with covariance matrix $\Sigma \in \mathbb{R}^{(n + m) \times (n + m)}$. Note that we can write $\Sigma$ as a block matrix, i.e., where $\Sigma_{xx}$ and $\Sigma_{yy}$ are the covariance matrices of $\mathbf{x}$ and $\mathbf{y}$, respectively, and $\Sigma_{xy} = (\Sigma_{yx})^T$ gives the covariance between $\mathbf{x}$ and $\mathbf{y}$. We remember the density function of a multivariate Gaussian distribution from above, and take a first stab: There’s only a slight problem. The inverse of the block matrix is pretty ugly: where $\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}$ is the Schur complement of $\Sigma_{xx}$ in the block matrix above. Let’s be lazy and delay computation by simply renaming the relevant parts, i.e., Proceeding bravely, we write where we get the last line by noting that $\mathbf{y}^T \Omega_{yx} \mathbf{x} = \left(\mathbf{x}^T \Omega_{xy} \mathbf{y}\right)^T$, i.e. they give the same scalar. It is also important to keep in mind that $ \Omega_{yy} \neq \Sigma_{yy}^{-1}$. There is hope: we are in an analogous situation as in the two-dimensional case described above. Somehow we must be able to ‘‘complete the square’’ in the more general $p$-dimensional case, too. Scribbling on paper for a bit, we dare to conjecture that the conditional distribution is which expands into For our conjecture to be true, it must hold that Indeed, remember that and therefore This means that we have correctly completed the square! To clean up the business of the determinants, note that the determinant of a block matrix factors such that Substituting this into our equation for the conditional density, as well as substituting all the $\Omega$’s with $\Sigma$’s, results in Thus, if $(\mathbf{x}, \mathbf{y})$ are jointly normally distributed, then incorporating the information that $Y = \mathbf{y}$ leads to a conditional distribution $f(\mathbf{x} \mid \mathbf{y})$ that is Gaussian with conditional mean $\Sigma_{xy} \Sigma_{yy}^{-1} \mathbf{y}$ and conditional covariance $\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}$. Conclusion In this blog post, we have seen that the Gaussian distribution has two important properties: it is closed under (a) marginalization and (b) conditioning. For the bivariate case, an accompanying Shiny app hopefully helped to build some intuition about the difference between these two operations. For the general $p$-dimensional case, we noted that a random variable per definition follows a (multivariate) Gaussian distribution if and only if every linear combination of its components follows a Gaussian distribution. This made it obvious that the Gaussian distribution is closed under marginalization — we simply ignore the components we want to marginalize over in the linear combination. To show that an arbitrary dimensional Gaussian distribution is closed under conditioning, we had to rely on a mathematical trick called ‘‘completing the square’’, as well as certain properties of matrices few mortals can remember. In conclusion, I think we should celebrate the fact that frequent operations such as marginalizing and conditioning do not expel us from the wonderful land of the Gaussians.5 I would like to thank Don van den Bergh and Sophia Crüwell for helpful comments on this blogpost. Footnotes $\Sigma^{-1}$ is the main object of interest in Gaussian graphical models. This is because of another special property of the Gaussian: if the off-diagonal element $(i, j)$ in $\Sigma^{-1}$ is zero, then the variables $X_i$ and $X_j$ are conditionally independent given all the other variables — there is no edge between those two variables in the graph. ↩ You might enjoy training your intuitions about correlations on http://guessthecorrelation.com/. ↩ See also Dennis Lindley’s paper The philosophy of statistics. ↩ This follows from the Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2} = \sqrt{\pi}$, see here. For more on why $\pi$ and $e$ feature in the Gaussian density, see this. ↩ The land of the Gaussians is vast: its inhabitants — all Gaussian distributions — are also closed under multiplication and convolution. This might make for a future blog post. ↩Curve fitting and the Gaussian distribution2019-01-11T16:30:00+00:002019-01-11T16:30:00+00:00https://fabiandablander.com/r/Curve-Fitting-Gaussian<p>Judea Pearl said that much of machine learning is just curve fitting<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> — but it is quite impressive how far you can get with that, isn’t it? In this blog post, we will look at the mother of all curve fitting problems: fitting a straight line to a number of points. In doing so, we will engage in some statistical detective work and discover the methods of least squares as well as the Gaussian distribution.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<h2 id="fitting-a-line">Fitting a line</h2>
<p>A straight line in the Euclidean plane is described by an intercept (<script type="math/tex">b_0</script>) and a slope ($b_1$), i.e.,</p>
<script type="math/tex; mode=display">y = b_0 + b_1x \enspace .</script>
<p>We are interested in finding the values for $(b_0, b_1)$, and so we must collect data points $d = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$. Data collection is often tedious, so let’s do it one point at a time. The first point is $P_1 = (x_1, y_1) = (1, 2)$, and if we plug the values into the equation for the line (i.e., set $x_1 = 1$ and $y_1 = 2$), we get</p>
<script type="math/tex; mode=display">2 = b_0 + 1b_1 \enspace ,</script>
<div style="float: left; padding: 10px 10px 10px 0px;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-1-1.png" title="Fitting a line with only one point (underdetermined)." alt="Fitting a line with only one point (underdetermined)." style="display: block; margin: auto auto auto 0;" />
</div>
<p>an equation with two unknowns. We call this system of equations <em>underdetermined</em> because we cannot uniquely solve for $b_0$ and $b_1$, but we will have a number of solutions all for which $b_1 = 2 - b_0$; see Figure 1 on the left. However, if we add another point $P_2 = (3, 1)$, the resulting system of equations becomes</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
2 &= b_0 + 1b_1 \\
1 &= b_0 + 3b_1 \enspace .
\end{aligned} %]]></script>
<p>We have two equations in two unknowns, and this is <em>determined</em> or <em>identified</em>: there is a unique solution for $b_0$ and $b_1$. After some rearranging, we find $b_1 = -0.5$ and $b_0 = 2.5$. This specifies exactly one line, as you can see in Figure 2 on the right.</p>
<div style="float: right; padding: 10px 0px 10px 10px">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-2-1.png" title="Fitting a line with two points (determined)." alt="Fitting a line with two points (determined)." style="display: block; margin: auto 0 auto auto;" />
</div>
<p>We could end the blog post here, but that would not be particularly insightful for data analysis problems in which we have more data points. Thus, let’s see where it takes us when we add another point, $P_3 = (2, 2)$. The resulting system of equations becomes</p>
<p><script type="math/tex">% <![CDATA[
\begin{aligned}
2 &= b_0 + 1b_1\\
1 &= b_0 + 3b_1 \\
2 &= b_0 + 2b_1 \enspace ,
\end{aligned} %]]></script></p>
<div style="float: left;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-3-1.png" title="Fitting lines with three points (overdetermined)." alt="Fitting lines with three points (overdetermined)." style="display: block; margin: auto 0 auto auto;" />
</div>
<p>which is <em>overdetermined</em> — we cannot fit a line that passes through all three of these points. We can, for example, fit three <em>separate</em> lines, given by two out of three of the equations; see Figure 3. But which of these lines, if any, is the “best” line?</p>
<h2 id="overfitting-a-curve">(Over)Fitting a curve</h2>
<p>Lacking justification to choose between any of these three, we could reduce the case to one that we have solved already, which is usually a good strategy in mathematics. In particular, we could try to reduce the <em>overdetermined</em> to the <em>determined</em> case. Above, we noticed that we can exactly solve for the two parameters $(b_0, b_1)$ using two data points ${P_1, P_2}$. This generalizes such that we can exactly solve for $p$ parameters using $p$ data points. In the problem above, we have three data points, but only two parameters. Let’s bend the notion of a <em>line</em> a bit — call it <em>curve</em> — and introduce a third parameter $b_2$. But what multiplies this parameter $b_2$ in our equations? It seems we are missing a dimension. To amend this, let’s add a dimension by simply squaring the $x$ coordinate such that a new point becomes $P_1’ = (y_1, x_1, x_1^2)$. The resulting system of equations is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
2 &= b_0 + 1b_1 + 1b_2\\
1 &= b_0 + 3b_1 + 9b_2\\
2 &= b_0 + 2b_1 + 4b_2 \enspace .
\end{aligned} %]]></script>
<p>To simplify notation, we can write these equations in matrix algebra. Specifically, we write</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{y} &= \mathbf{X}\mathbf{b} \\[1em]
\begin{pmatrix}
2 \\ 1 \\ 2
\end{pmatrix} &=
\begin{pmatrix}
1 & 1 & 1\\
1 & 3 & 9\\
1 & 2 & 4
\end{pmatrix} \cdot
\begin{pmatrix}
b_0 \\ b_1 \\ b_2
\end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>where we are again interested in solving for the unknown $\mathbf{b}$. Because this system is <em>determined</em>, we can arrive at the solution by inverting the matrix $\mathbf{X}$, such that $\mathbf{b} = \mathbf{X}^{-1}\mathbf{y}$, where $\mathbf{X}^{-1}$ is the inverse of $\mathbf{X}$. The resulting “line” is shown in Figure 4 on the left.</p>
<div style="float: left;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-4-1.png" title="Fitting a curve with three points" alt="Fitting a curve with three points" style="display: block; margin: auto 0 auto auto;" />
</div>
<p>There are two issues with this approach. First, it leads to overfitting, that is, while we explain the data at hand well (in fact, we do so perfectly), it might poorly generalize to new data. For example, this curve is so peculiar (and it would get much more peculiar if we had fitted it to more data in the same way) that it is likely that new points lie far away from it. Second, we haven’t really explained anything. In the words of the great R.A. Fisher:</p>
<blockquote>
<p>“[T]he objective of statistical methods is the reduction of data. A quantity of data, which usually by its mere bulk is incapable of entering the mind, is to be replaced by relatively few quantities which shall adequately represent […] the relevant information contained in the original data.” (Fisher, 1922, p. 311)</p>
</blockquote>
<p>By introducing as many parameters as we have data points, no such reduction has taken place. So let’s go back to our original problem: which line should we draw through three, or more generally, any number of points $n$?</p>
<h2 id="legendre-and-the-best-fit">Legendre and the “best fit”</h2>
<p>Reducing the overdetermined to the determined case did not really work. But there is still one option: reducing it to the underdetermined case. To achieve that, we make the reasonable assumption that each observation is corrupted by noise, such that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
1 &= b_0 + 4b_1 + \epsilon_1 \\
4 &= b_0 + 2b_1 + \epsilon_2 \\
3 &= b_0 + 1b_1 + \epsilon_3
\end{aligned} %]]></script>
<p>where ($\epsilon_1$, $\epsilon_2$, $\epsilon_3$) are unobserved quantities<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. This introduces another $n$ unknowns! Therefore, we have too few equations for too many parameters — our previously overdetermined system becomes underdetermined.</p>
<p>However, as we saw above, we cannot uniquely solve an underdetermined system of equations. We have to add more constraints. Adrien-Marie Legendre, of whom <a href="https://en.wikipedia.org/wiki/Adrien-Marie_Legendre#/media/File:Legendre.jpg">favourable pictures are difficult to find</a>, proposed what has become known as the <em>methods of least squares</em> to solve this problem:</p>
<blockquote>
<p>“Of all the principles which can be proposed for that purpose, I think there is none more general, more exact, and more easy of application, that of which we made use in the preceding researches, and which consists of rendering the sum of squares of the errors a <em>minimum</em>. By this means, there is established among the errors a sort of equilibrium which, preventing the extremes from exerting an undue influence, is very well fitted to reveal that state of the system which most nearly approaches the truth. (Legendre, 1805, p. 72-73)</p>
</blockquote>
<div style="float: right; padding: 10px 10px 10px 0px;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-5-1.png" title="Fitting the line that minimizes the sum of squared errors." alt="Fitting the line that minimizes the sum of squared errors." style="display: block; margin: auto 0 auto auto;" />
</div>
<p>There is only one line that minimizes the sum of squared errors. Thus, by adding this constraint we can uniquely solve the underdetermined system; see the Figure on the right. The development of least squares was a watershed moment in mathematical statistics — Stephen Stigler likens its importance to the development of calculus in mathematics (Stigler, 1986, p. 11).</p>
<p>We have now seen, conceptually, how to find the “best” fitting line. But how do we do it mathematically? How do we arrive at the Figure on the right? I will illustrate two approaches: the one proposed by Legendre using optimization, and another one using a geometrical insight.</p>
<h2 id="least-squares-i-optimization">Least squares I: Optimization</h2>
<p>Our goal is to find the line that minimizes the <em>sum of squared errors</em>. To simplify, we center the data by subtracting the mean from $y$ and $x$, respectively; i.e., $y’ = y - \frac{1}{n} \sum_{i=1}^n y_i$ and $x’ = x - \frac{1}{n} \sum_{i=1}^n x_i$. This makes it such that the intercept is zero, $b_0 = 0$, and we avoid the need to estimate it. In the following, to avoid cluttering notation, I will omit the apostrophe and assume both $y$ and $x$ are mean-centered.</p>
<p>For a particular observation $y_i$, our line predicts it to be $x_i b_1$. This implies that the error is $\epsilon_i = y_i - x_i b_1$, and the sum of all squared errors is</p>
<script type="math/tex; mode=display">\sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y_i - x_i b_1)^2 \enspace .</script>
<p>We want to find the value for $b_1$, call it $\hat{b}_1$, that minimizes this quantity; that is, we must solve</p>
<script type="math/tex; mode=display">\hat{b}_1 = \underbrace{\text{argmin}}_{b_1} \left (\sum_{i=1}^n (y_i - x_i b_1)^2 \right) \enspace</script>
<p>We could use fancy algorithms like <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a>, but we can also engage in some good old high school mathematics and minimize the expression analytically. We note that the expression is quadratic, and thus has a single minimum, and this happens when the derivative is zero. Alas, to work!</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= \frac{\partial}{\partial b_1} \left (\sum_{i=1}^n (y_i - x_i b_1)^2 \right) \\[0.5em]
0 &= \frac{\partial}{\partial b_1} \left(\sum_{i=1}^n y_i^2 - 2\sum_{i=1}^n y_i x_i b_1 + \sum_{i=1}^n x_i^2 b_1^2 \right) \\[0.5em]
0 &= 0 - 2 \sum_{i=1}^n y_i x_i + 2 \sum_{i=1}^n x_i^2 b_1 \\[0.5em]
2 \sum_{i=1}^n x_i^2 b_1 &= 2 \sum_{i=1}^n y_i x_i \\[0.5em]
\hat{b}_1 &= \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i^2} \enspace ,
\end{aligned} %]]></script>
<p>where $\sum_{i=1}^n y_i x_i$ is the (scaled by $n$) covariance between x and y, and $\sum_{i=1}^n x_i^2$ is the (scaled by $n$) variance of x.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<h2 id="least-squares-ii-projection">Least squares II: Projection</h2>
<p>Another way to think about this problem is <em>geometrically</em>. This requires some linear algebra, and so we better write the system of equations in matrix form. For ease of exposure, we again mean-center the data. First, note that the errors in matrix form yield</p>
<div style="float: left; padding: 10px 10px 10px 0px;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-6-1.png" title="Figure illustrating the geometric insight." alt="Figure illustrating the geometric insight." style="display: block; margin: auto 0 auto auto;" />
</div>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix}
\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n
\end{pmatrix} &=
\begin{pmatrix}
y_1 \\ y_2 \\ \vdots \\ y_n
\end{pmatrix} -
\begin{pmatrix}
x_1 \\ x_2 \\ \vdots \\ z_n
\end{pmatrix} b_1 \\[1em]
\mathbf{\epsilon} &= \mathbf{y} - \mathbf{x}b_1 \enspace .
\end{aligned} %]]></script>
<p>and that the errors are <em>perpendicular</em> to the x-axis, that is, they are at a 90 degree angle of each other; see the Figure on the left. This means that the <em>dot product</em> of the vector of errors and the x-axis points is zero, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
x_1 & x_2 & \ldots & x_n
\end{pmatrix}
\begin{pmatrix}
\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n
\end{pmatrix} = 0 \enspace , %]]></script>
<p>or $\mathbf{x}^T \mathbf{\epsilon} = 0$, in short. Using this geometric insight, we can derive the least squares solution as follows</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{x}^T \mathbf{\epsilon} &= 0 \\[.5em]
\mathbf{x}^T \left(\mathbf{y} - \mathbf{x} b_1 \right) &= 0 \\[.5em]
\mathbf{x}^T \mathbf{x} b_1 &= \mathbf{x}^T \mathbf{y} \\[.5em]
b_1 &= \frac{\mathbf{x}^T \mathbf{y}}{\mathbf{x}^T \mathbf{x}} \\[.5em]
b_1 &= \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} \enspace ,
\end{aligned} %]]></script>
<p>which yields the same result as above.<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup> As an important special case, note the least square solution to a system of equations with only the intercept $b_0$ as unknown, i.e., $y_i = b_0$, yields the mean of $y$. It is this fact that Gauss used to justify the Gaussian distribution as an error distribution, see below.</p>
<h2 id="gauss-laplace-and-how-good-is-best">Gauss, Laplace, and “how good is best?”</h2>
<!-- The methods of least squares, first published by Legendre in 1805, is an intuitive method to fit curves. *Of course you would minimize the sum of squared errors*, you might say. We could either predict too high or too low a value, leading to positive or negative errors, respectively. So as to not cancel each other out when summing them, we square the errors them. The square also penalizes large errors more than smaller errors ... *now that I think of it*, you might continue, *why not just take the absolute value? Why square?* -->
<p>The method of least squares yields the “best” fitting line in the sense that it minimizes the sum of squared errors. But without any statements about the stochastic nature of the errors $\mathbf{\epsilon}$, the question of “how good is best?” remains unanswered.</p>
<p>It was Carl Friedrich Gauss who in 1809 couched the least squares problem in <em>probabilistic terms</em>. Specifically, he assumed that each error term $\epsilon_i$ comes from some distribution $\phi$. Using this distribution, the probability (density) of a particular $\epsilon_i$ is large when $\epsilon_i$ is small, that is, when observed and predicted value are close together. Further assuming that the errors are <em>independent and identically</em> distributed, he wanted to find the parameter values which <em>maximize</em></p>
<script type="math/tex; mode=display">\Omega = \phi(\epsilon_1) \cdot \phi(\epsilon_2) \cdot \ldots \cdot \phi(\epsilon_n) = \prod_{i=1}^n \phi(\epsilon_i) \enspace ,</script>
<p>that is, maximize the probability of the errors being small (see also Stigler, 1986, p. 141).<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></p>
<p><img src="../assets/img/Gauss.jpg" align="left" style="padding: 10px 10px 10px 0px;" /></p>
<p>All that is left now is to find the distribution $\phi$. Gauss noticed that he could make some general statements about $\phi$, namely that it should be symmetric and have its maximum at 0. He then <em>assumed</em> that the mean should be the best value for summarizing $n$ measurements $(y_1, \ldots, y_n)$; that is, he assumed that maximizing $\Omega$ should lead to the same solution as minimizing the sum of squared errors when we have one unknown.<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></p>
<p>With this circularity — to justify least squares, I assume least squares — he proved that the distribution must be of the form</p>
<script type="math/tex; mode=display">\phi(\epsilon_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left (-\frac{1}{2\sigma^2} \epsilon_i^2 \right) \enspace ,</script>
<p>where $\sigma^2$ is the variance<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup>; see the Figure below for three examples. The distribution has become known as the Gaussian distribution, although — in the spirit of Stigler’s law of eponomy<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup> — de Moivre and Laplace have discovered it before Gauss (see also Stahl, 2006). Karl Pearson popularized the term <em>normal distribution</em>, an act for which he seems to have shown some regret:</p>
<blockquote>
<p>“Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another ‘abnormal’.” (Pearson, 1920, p. 25)</p>
</blockquote>
<!-- <div style= "float: left; padding: 10px 10px 10px 0px;"> -->
<p><img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-7-1.png" title="Shows three Normal distributions with different variance." alt="Shows three Normal distributions with different variance." style="display: block; margin: auto;" />
<!-- </div> --></p>
<p>Using the Gaussian distribution, the maximization problem becomes</p>
<script type="math/tex; mode=display">\begin{aligned}
\Omega = \prod_{i=1} \phi(\epsilon_i) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left (-\frac{1}{2\sigma^2} \epsilon_i^2 \right) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp \left (-\frac{1}{2\sigma^2} \sum_{i=1}^n \epsilon_i^2 \right) \enspace .
\end{aligned}</script>
<p>Note that the value at which $\Omega$ takes its maximum does not change when we drop the constants and take logarithms. This results in $-\sum_{i=1}^n \epsilon_i^2$ as being the expression to be maximized, which is the same as minimizing its negation, that is, minimizing the sum of squared errors.</p>
<p><img src="../assets/img/Laplace.jpg" align="right" style="padding: 10px 0px 10px 10px" /></p>
<p>The “Newton of France”, Pierre Simone de Laplace, took notice of Gauss’ argument in 1810 and rushed to give the Gaussian error curve a much more beautiful justification. If we take the errors to be themselves aggregates of many (tiny) perturbing influences, then they will be normally distributed by the <em>central limit theorem</em>. So what is this central limit theorem, anyway?</p>
<h2 id="the-central-limit-theorem">The central limit theorem</h2>
<p>The central limit theorem is one of the most stunning theorems of statistics. In the poetic words of Francis Galton</p>
<blockquote>
<p>“I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66)</p>
</blockquote>
<p>The theorem basically says that if you have a sequence of independent and identically distributed random variables — what Galton calls “the mob” — and if that sequence has finite variance, than the mean of this sequence, as $n$ grows larger and larger — “the greater the apparent anarchy” — will get closer and closer to a normal distribution. As $n \rightarrow \infty$, the mean in fact converges in distribution to the normal distribution.<sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></p>
<!-- Formally, let $(X_1, X_2, \ldots, X_n)$ be a sequence of $n$ independent and identically distributed random variables with mean $\mu$ and variance $\sigma^2$. Then we have that, by the law of large numbers, that the sample mean $\bar{X}_n$ converges to $\mu$ as $n \rightarrow \infty$. The central limit theorem states that, as $n \leftarrow n$ -->
<!-- $$ -->
<!-- \sqrt{n} \left(\frac{\bar{X}_n - \mu}{\sigma}\right) \rightarrow \mathcal{N}(0, 1) -->
<!-- $$ -->
<p>Laplace realized that, if one takes the errors in the least squares problem to be themselves aggregates (i.e., means) of small influences, then they will be normally distributed. This provides an elegant justification for the least squares solution.</p>
<p>To illustrate, assume that a particular error $\epsilon_i$ is in fact the average of $m = 500$ small irregularities that are independent and identically distributed; for instance, assume these influences follow a uniform distributions. Let’s say we have $n = 200$ observations, thus 200 individual errors. The R code and Figure below illustrate that the error distribution will tend to be Gaussian.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1776</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">200</span><span class="w"> </span><span class="c1"># number of errors</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">500</span><span class="w"> </span><span class="c1"># number of influences one error is made of</span><span class="w">
</span><span class="c1"># compute errors which are themselves aggregates of smaller influences</span><span class="w">
</span><span class="n">errors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">runif</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="m">-10</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)))</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">errors</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Central Limit Theorem'</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">expression</span><span class="p">(</span><span class="n">epsilon</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'grey76'</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Frequency'</span><span class="p">,</span><span class="w">
</span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># plot approximate Gaussian density line</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">)</span><span class="w">
</span><span class="n">se</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">20</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">12</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se</span><span class="p">))</span></code></pre></figure>
<p><img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-8-1.png" title="Illustrates the Central Limit Theorem." alt="Illustrates the Central Limit Theorem." style="display: block; margin: auto;" /></p>
<p>I don’t know about you, but I think this is really cool. We started out with small irregularities that are uniformly distributed. Then we took an average of a bulk ($m = 500$) of those which constitute an error $\epsilon_i$; thus, the error itself is an aggregate. Now, by some fundamental fact about how our world works, the distribution of these errors (here, $n = 200$) can be well approximated by a Gaussian distribution. I can see why, as Galton conjectures, the Greeks would have deified such a law, if only they had known of it.</p>
<h2 id="linear-regression">Linear regression</h2>
<p>One neat feature of the Gaussian distribution is that any <em>linear combination</em> of normally distributed random variables is itself normally distributed. We may write the linear regression problem in matrix form, which makes apparent that $\mathbf{y}$ is a weighted linear combination of $\mathbf{x}$. Specifically, if we have $n$ data points, we have a system of $n$ equations which we can write in matrix notation more concisely</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{pmatrix} &=
\begin{pmatrix}
1 & x_1 \\
1 & x_2 \\
\vdots & \vdots \\
1 & x_n
\end{pmatrix} \cdot
\begin{pmatrix}
b_0 \\
b_1
\end{pmatrix} +
\begin{pmatrix}
\epsilon_1 \\
\epsilon_2 \\
\vdots \\
\epsilon_n \\
\end{pmatrix} \\[1em]
\mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{\epsilon}
\end{aligned} %]]></script>
<div style="float: left; padding: 10px 10px 10px 0px;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-9-1.png" title="Illustrates that linear regression assumes that the conditional distribution of the response y given the features x is a Gaussian distribution." alt="Illustrates that linear regression assumes that the conditional distribution of the response y given the features x is a Gaussian distribution." style="display: block; margin: auto;" />
</div>
<p>Due to this linearity, the assumption of normally distributed errors propagates and results in a <em>conditional normal distribution</em> of $\mathbf{y}$, that is,</p>
<script type="math/tex; mode=display">y_i \mid \mathbf{x}_i \sim \mathcal{N}(\mathbf{x}_i^T \mathbf{b}, \sigma^2) \enspace .</script>
<p>In other words, the probability density of a particular point $y_i$ is given by</p>
<script type="math/tex; mode=display">\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{1}{2\sigma^2} (y_i - \mathbf{x}_i^T \mathbb{b})^2\right) \enspace ,</script>
<p>which is visualized in the Figure on the left. Intuitively, the smaller the error variance, the better the fit.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post, we have discussed the mother of all curve fitting problems — fitting a straight line to data points — in great detail. On this journey, we have met the method of least squares, a pillar of statistical thinking. We have seen how Gauss arrived at “his” distribution, and how Laplace gave it a beautiful justification in terms of the central limit theorem. With this, it was only a small step towards linear regression, one of the most important tools in statistics and machine learning.</p>
<hr />
<p><em>I would like to thank Don van den Bergh and Jonas Haslbeck for helpful comments on this blogpost.</em></p>
<h2 id="post-scriptum-i-linking-correlation-to-regression">Post Scriptum I: Linking correlation to regression</h2>
<p>It is a <em>trivium</em> that correlation does not imply causation. <a href="https://stats.stackexchange.com/questions/376920/the-book-of-why-by-judea-pearl-why-is-he-bashing-statistics">Some</a> believe that linear regression is a causal model. This is not true. To see this, we can relate the regression coefficient in simple linear regression to correlation — they differ only in standardization.</p>
<p>Assuming mean-centered data, the sample Pearson correlation is defined as</p>
<script type="math/tex; mode=display">r_{xy} = \frac{\sum_{i=1}^n x_i y_i}{\sqrt{\sum_{i=1}^n x_i^2 \sum_{i=1}^n y_i^2}} \enspace .</script>
<p>Note that correlation is symmetric — it does not matter whether we correlate $x$ with $y$, or $y$ with $x$. In contrast, regession is not symmetric. In the main text, we have used $x$ to predict $y$ which yielded</p>
<script type="math/tex; mode=display">b_{xy} = \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i^2} \enspace .</script>
<p>If we were to use $y$ to predict $x$, the coefficient would be</p>
<script type="math/tex; mode=display">b_{yx} = \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n y_i^2} \neq b_{xy} \neq r_{xy} \enspace .</script>
<p>However, by <em>standardizing</em> the data, that is, by dividing the variables by there respective standard deviations, the regression coefficient becomes the sample correlation, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial L}{\partial b_{xy}} &= \frac{\partial}{\partial b_{xy}} \sum_{i=1}^n\left(\frac{y_i}{\sqrt{\sum_{i=1}^n y_i^2}} - b_{xy} \frac{x_i}{\sqrt{\sum_{i=1}^n x_i^2}} \right)^2 \\[0.5em]
&= \frac{\partial}{\partial b_{xy}} \left( \frac{\sum_{i=1}^n y_i^2}{\sqrt{\sum_{i=1}^n y_i^2}} - 2 b_{xy} \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} + b_{xy}^2 \frac{\sum_{i=1}^n x_i^2}{\sum_{i=1}^n x_i^2}\right)\\[0.5em]
&= 0 - 2 \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} + 2 b_{xy} \\[0.5em]
2 b_{xy} &= 2 \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} \\[0.5em]
b_{xy} &= \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} \enspace,
\end{aligned} %]]></script>
<p>which is equal to $r_{xy}$. This <em>standardized</em> regression coefficient can also be achieved by multiplying the <em>raw</em> regression coefficient, i.e.,</p>
<script type="math/tex; mode=display">b_s = b_{xy} \times \frac{\sqrt{\sum_{i=1}^n x_i^2}}{\sqrt{\sum_{i=1}^n y_i^2}} = \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i^2} \times \frac{\sqrt{\sum_{i=1}^n x_i^2}}{\sqrt{\sum_{i=1}^n y_i^2}} = \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}}.</script>
<p>$b_{yx}$ can be standardized in a similar way, such that</p>
<script type="math/tex; mode=display">b_s = b_{yx} \times \frac{\sqrt{\sum_{i=1}^n y_i^2}}{\sqrt{\sum_{i=1}^n x_i^2}} = \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n y_i^2} \times \frac{\sqrt{\sum_{i=1}^n y_i^2}}{\sqrt{\sum_{i=1}^n x_i^2}} = \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} \enspace .</script>
<!-- ## Post Scriptum II: Why $\pi$? -->
<!-- You might wonder why there is a $\pi$ in the expression of the normal distribution. The mathematical reason for this is that -->
<!-- $$ -->
<!-- \int_{-\infty}^{\infty} e^{-x^2 / 2} \, \mathrm{d}x = \sqrt{2\pi} \enspace , -->
<!-- $$ -->
<!-- shown by Laplace. -->
<!-- Because the proof is so cool, I reproduce it here. As is so often the case in mathematics, writing things more complicated can help[^9], i.e., -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \left(\int_{-\infty}^{\infty} e^{-x^2 / 2} \, \mathrm{d}x \right) \left(\int_{-\infty}^{\infty} e^{-x^2 / 2} \, \mathrm{d}x \right) &= \left(\int_{-\infty}^{\infty} e^{-x^2 / 2} \, \mathrm{d}x \right) \left(\int_{-\infty}^{\infty} e^{-y^2 / 2} \, \mathrm{d}y \right) \\[1em] -->
<!-- &= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-\frac{x^2 + y^2}{2}} \, \mathrm{d}x \mathrm{d}y \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- Note that we can describe a point in the plane using polar coordinates, i.e., $x = r \, \text{cos}\,\theta$ and $y = r \, \text{sin}\,\theta$, where $r$ is the distance of $(x, y)$ to the origin and $\theta \in [0, 2\pi)$ is the angle. The Jacobian matrix of this transformation is -->
<!-- $$ -->
<!-- \frac{d(x, y)}{d(r, \theta)} = \begin{pmatrix} \text{cos}\,\theta & -r \, \text{sin}\, \theta\\ \text{sin}\,\theta & r \, \text{cos}\,\theta \end{pmatrix} \enspace , -->
<!-- $$ -->
<!-- which has determinant $r$. Noting that $x^2 + y^2 = r^2$, we continue with -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \int_{0}^{2\pi} \int_{0}^{\infty} r \cdot e^{-r^2 / 2} \, \mathrm{d}r \mathrm{d}\theta = \int_{0}^{\infty} -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- ## Post Scriptum II: Curves and regularization -->
<!-- In the blog post, I have hinted at the fact that adding polynomials in order to fit the data perfectly generalizes poorly to unseen data. We can show this with a simple example. Assume we observe $n = 10$ data points from the following polynomial function -->
<!-- $$ -->
<!-- y = 2 + 0.25 x -0.75 x^2 \enspace . -->
<!-- $$ -->
<!-- ```{r, echo = FALSE} -->
<!-- gen_y <- function(x, sd_err = NULL) { -->
<!-- y <- 2 + 3 * x - 0.2 * x^2 -->
<!-- if (!is.null(sd_err)) { y <- y + rnorm(length(x), 0, sd_err) } -->
<!-- y -->
<!-- } -->
<!-- get_coefs <- function(y, X) solve(t(X) %*% X) %*% t(X) %*% y -->
<!-- get_prederr <- function(test_set, train_set) { -->
<!-- } -->
<!-- n_train <- 50 -->
<!-- n_test <- 150 -->
<!-- x <- seq(0, 10, length.out = n_train + n_test) -->
<!-- data_sets <- t(replicate(n = 1000, gen_y(x, sd_err = 1))) -->
<!-- train_sets <- data_sets[, seq(n_train)] -->
<!-- test_sets <- data_sets[, -seq(n_train)] -->
<!-- plot(1, type = "n", xlim = c(0, 10), ylim = c(0, 15), -->
<!-- bty = "n", xlab = "x", ylab = "y", main = 'True function' -->
<!-- ) -->
<!-- lines(x, gen_y(x), pch = 20, col = 'skyblue') -->
<!-- x <- seq(0, 10) -->
<!-- ``` -->
<hr />
<h2 id="references">References</h2>
<ul>
<li>
<p>Blitzstein, J. K., & Hwang, J. (2014). <em>Introduction to Probability</em>. Chapman and Hall/CRC. [<a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1466575573">Link</a>]</p>
</li>
<li>
<p>Ford, M. (2018). <em>Architects of Intelligence: The truth about AI from the people building it</em>. Packt Publishing. [<a href="https://www.amazon.com/Architects-Intelligence-truth-people-building/dp/1789131510/ref=sr_1_1?s=books&ie=UTF8&qid=1546765292&sr=1-1&keywords=martin+ford">Link</a>]</p>
</li>
<li>
<p>Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. <em>Phil. Trans. R. Soc. Lond. A</em>, <em>222(594-604)</em>, 309-368. [<a href="https://royalsocietypublishing.org/doi/abs/10.1098/rsta.1922.0009">Link</a>]</p>
</li>
<li>
<p>Galton, F. (1889). <em>Natural Inheritance</em>. London, UK: Richard Clay and Sons. [<a href="http://galton.org/books/natural-inheritance/pdf/galton-nat-inh-1up-clean.pdf">Link</a>]</p>
</li>
<li>
<p>Pearson, K. (1920). Notes on the history of correlation. <em>Biometrika</em>, <em>13(1)</em>, 25-45. [<a href="https://www.jstor.org/stable/2331722?seq=1#metadata_info_tab_contents">Link</a>]</p>
</li>
<li>
<p>Stahl, S. (2006). The evolution of the normal distribution. <em>Mathematics magazine</em>, <em>79(2)</em>, 96-113. [<a href="https://www.tandfonline.com/doi/abs/10.1080/0025570X.2006.11953386?journalCode=umma20">Link</a>]</p>
</li>
<li>
<p>Stigler, S. M. (1980). Stigler’s Law of Eponymy. <em>Transactions of the New York Academy of Sciences</em>, <em>39(1 Series II)</em>, 147-157. [<a href="https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/j.2164-0947.1980.tb02775.x">Link</a>]</p>
</li>
<li>
<p>Stigler, S. M. (1981). Gauss and the invention of least squares. <em>The Annals of Statistics</em>, <em>9(3)</em>, 465-474. [<a href="https://projecteuclid.org/download/pdf_1/euclid.aos/1176345451">Link</a>]</p>
</li>
<li>
<p>Stigler, S. M. (1986). <em>The history of statistics: The measurement of uncertainty before 1900</em>. Harvard University Press. [<a href="http://www.hup.harvard.edu/catalog.php?isbn=9780674403413">Link</a>]</p>
</li>
<li>
<p>Stigler, S. M. (2007). The epic story of maximum likelihood. <em>Statistical Science</em>, <em>22(4)</em>, 598-620. [<a href="https://www.jstor.org/stable/27645865?casa_token=QqFTvsgYX0MAAAAA:VfdvDgUOdMH95y5V-d9YQ4P1SemlxCU7Xrx-9OIEG4EN69iIU3L7yU5q4XIewzYjPhpDzKFh-LbJk6X6RiogDo_2fw4kI0Q_Tl5GSgBvaTdzwGGHTj_xQQ&seq=1#metadata_info_tab_contents">Link</a>]</p>
</li>
<li>
<p>Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. <em>Perspectives on Psychological Science</em>, <em>12(6)</em>, 1100-1122. [<a href="https://journals.sagepub.com/doi/abs/10.1177/1745691617693393?casa_token=FaEkfz8xxLMAAAAA%3AxO7ygcT8h8GVYPqizcJ8Mt3spZ8vinhA4yGQ_j1w_-HwjqZ04-yphCnCsC0j0S2xghh5DR69ppb3od4">Link</a>]</p>
</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>See <a href="https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/">this</a> interview and Martin Ford’s <a href="https://www.amazon.com/Architects-Intelligence-truth-people-building/dp/1789131510/ref=sr_1_1?s=books&ie=UTF8&qid=1546765292&sr=1-1&keywords=martin+ford">new book</a> in which he interviews 23 leading voices in AI, one of them being Pearl. (The comment about curve fitting is on p. 366). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This blog post was in part inspired by Neil Lawrence’s talk at the Gaussian Process summer school last year. You can view it <a href="http://gpss.cc/gpss18/program">here</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I think it is fair to say that this ‘error’ is a catch-all term, expressing our ignorance or lack of knowledge. <a href="https://www.bayesianspectacles.org/a-galton-board-demonstration-of-why-all-statistical-models-are-misspecified/">This</a> blog post argues that, by virtue of introducing error, all statistical models are misspecified. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>The generalization to higher dimensions is straightforward. Instead of simple derivatives, we have to take partial derivatives, one for each parameter $b$. When expressed in matrix form, the solution yields $\mathbf{b} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T y$, where $\mathbf{x}$ is a $n \times p$ matrix and $\mathbf{b}$ is a $p \times 1$ vector. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The generalization to higher dimensions is straightforward. Simply let $\mathbf{x}$ to be a $n \times p$ matrix and $\mathbf{b}$ be a $p \times 1$ vector, where $p$ is the number of dimensions. The derivation is exactly the same, except that because $\mathbf{x}^T \mathbf{x}$ is not a scalar anymore but a matrix, we have to left-multiply by its inverse. This yields $\mathbf{b} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T y$. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>To arrive at this maximization criterion, Gauss used Bayes’ rule. Inspired by Laplace, he assumed a uniform prior over the parameters and chose the mode of the posterior distribution over the parameters as his estimator; this is equivalent to maximum likelihood estimation, see also Stigler (2008). <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>Note that while I have mostly talked about “fitting lines” or “predicting values”, the words used during the discovery of least squares where “summarizing” or “aggregating” observations. For instance, they would say that, absent other information, the mean summarizes the data best while I would, using the language of this blog post, be forced to say the mean predicts the data best. I think that “summarizing” is more adequate than “predict”, especially since we are not predicting out of sample (see also Yarkoni & Westfall, <a href="https://journals.sagepub.com/doi/abs/10.1177/1745691617693393?casa_token=HBqivCFyDcUAAAAA%3ABMKLq2EDzASwBuP5yNRBXk45iblKe1RJ9-lBSI3sR70ATw28R7gilW1s30iDIgW8QYonpDqxs14J9w">2017</a>). <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>Gauss introduced a <a href="https://en.wikipedia.org/wiki/Normal_distribution#History">different parameterization</a> using <em>precision</em> instead of variance. The parameterization using variance was introduced by Fisher. (Karl Pearson used the standard deviation before.) <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>The law states that “no scientific discovery is named after its original discoverer” (Stigler, <a href="https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/j.2164-0947.1980.tb02775.x">1980</a>, p. 147). <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>The proof is not difficult, and can be found in Blitzstein & Hwang (<a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1466575573">2014</a>, p. 436), which is an amazing book on probability theory full of <a href="https://twitter.com/fdabl/status/981106227143954432">gems</a>. I wholeheartedly recommend working through Blitzstein’s <a href="https://projects.iq.harvard.edu/stat110/youtube">Stat 110</a> class — it’s one of the best classes I ever took (online, and in general). <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderJudea Pearl said that much of machine learning is just curve fitting1 — but it is quite impressive how far you can get with that, isn’t it? In this blog post, we will look at the mother of all curve fitting problems: fitting a straight line to a number of points. In doing so, we will engage in some statistical detective work and discover the methods of least squares as well as the Gaussian distribution.2 Fitting a line A straight line in the Euclidean plane is described by an intercept () and a slope ($b_1$), i.e., We are interested in finding the values for $(b_0, b_1)$, and so we must collect data points $d = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$. Data collection is often tedious, so let’s do it one point at a time. The first point is $P_1 = (x_1, y_1) = (1, 2)$, and if we plug the values into the equation for the line (i.e., set $x_1 = 1$ and $y_1 = 2$), we get an equation with two unknowns. We call this system of equations underdetermined because we cannot uniquely solve for $b_0$ and $b_1$, but we will have a number of solutions all for which $b_1 = 2 - b_0$; see Figure 1 on the left. However, if we add another point $P_2 = (3, 1)$, the resulting system of equations becomes We have two equations in two unknowns, and this is determined or identified: there is a unique solution for $b_0$ and $b_1$. After some rearranging, we find $b_1 = -0.5$ and $b_0 = 2.5$. This specifies exactly one line, as you can see in Figure 2 on the right. We could end the blog post here, but that would not be particularly insightful for data analysis problems in which we have more data points. Thus, let’s see where it takes us when we add another point, $P_3 = (2, 2)$. The resulting system of equations becomes which is overdetermined — we cannot fit a line that passes through all three of these points. We can, for example, fit three separate lines, given by two out of three of the equations; see Figure 3. But which of these lines, if any, is the “best” line? (Over)Fitting a curve Lacking justification to choose between any of these three, we could reduce the case to one that we have solved already, which is usually a good strategy in mathematics. In particular, we could try to reduce the overdetermined to the determined case. Above, we noticed that we can exactly solve for the two parameters $(b_0, b_1)$ using two data points ${P_1, P_2}$. This generalizes such that we can exactly solve for $p$ parameters using $p$ data points. In the problem above, we have three data points, but only two parameters. Let’s bend the notion of a line a bit — call it curve — and introduce a third parameter $b_2$. But what multiplies this parameter $b_2$ in our equations? It seems we are missing a dimension. To amend this, let’s add a dimension by simply squaring the $x$ coordinate such that a new point becomes $P_1’ = (y_1, x_1, x_1^2)$. The resulting system of equations is To simplify notation, we can write these equations in matrix algebra. Specifically, we write where we are again interested in solving for the unknown $\mathbf{b}$. Because this system is determined, we can arrive at the solution by inverting the matrix $\mathbf{X}$, such that $\mathbf{b} = \mathbf{X}^{-1}\mathbf{y}$, where $\mathbf{X}^{-1}$ is the inverse of $\mathbf{X}$. The resulting “line” is shown in Figure 4 on the left. There are two issues with this approach. First, it leads to overfitting, that is, while we explain the data at hand well (in fact, we do so perfectly), it might poorly generalize to new data. For example, this curve is so peculiar (and it would get much more peculiar if we had fitted it to more data in the same way) that it is likely that new points lie far away from it. Second, we haven’t really explained anything. In the words of the great R.A. Fisher: “[T]he objective of statistical methods is the reduction of data. A quantity of data, which usually by its mere bulk is incapable of entering the mind, is to be replaced by relatively few quantities which shall adequately represent […] the relevant information contained in the original data.” (Fisher, 1922, p. 311) By introducing as many parameters as we have data points, no such reduction has taken place. So let’s go back to our original problem: which line should we draw through three, or more generally, any number of points $n$? Legendre and the “best fit” Reducing the overdetermined to the determined case did not really work. But there is still one option: reducing it to the underdetermined case. To achieve that, we make the reasonable assumption that each observation is corrupted by noise, such that where ($\epsilon_1$, $\epsilon_2$, $\epsilon_3$) are unobserved quantities3. This introduces another $n$ unknowns! Therefore, we have too few equations for too many parameters — our previously overdetermined system becomes underdetermined. However, as we saw above, we cannot uniquely solve an underdetermined system of equations. We have to add more constraints. Adrien-Marie Legendre, of whom favourable pictures are difficult to find, proposed what has become known as the methods of least squares to solve this problem: “Of all the principles which can be proposed for that purpose, I think there is none more general, more exact, and more easy of application, that of which we made use in the preceding researches, and which consists of rendering the sum of squares of the errors a minimum. By this means, there is established among the errors a sort of equilibrium which, preventing the extremes from exerting an undue influence, is very well fitted to reveal that state of the system which most nearly approaches the truth. (Legendre, 1805, p. 72-73) There is only one line that minimizes the sum of squared errors. Thus, by adding this constraint we can uniquely solve the underdetermined system; see the Figure on the right. The development of least squares was a watershed moment in mathematical statistics — Stephen Stigler likens its importance to the development of calculus in mathematics (Stigler, 1986, p. 11). We have now seen, conceptually, how to find the “best” fitting line. But how do we do it mathematically? How do we arrive at the Figure on the right? I will illustrate two approaches: the one proposed by Legendre using optimization, and another one using a geometrical insight. Least squares I: Optimization Our goal is to find the line that minimizes the sum of squared errors. To simplify, we center the data by subtracting the mean from $y$ and $x$, respectively; i.e., $y’ = y - \frac{1}{n} \sum_{i=1}^n y_i$ and $x’ = x - \frac{1}{n} \sum_{i=1}^n x_i$. This makes it such that the intercept is zero, $b_0 = 0$, and we avoid the need to estimate it. In the following, to avoid cluttering notation, I will omit the apostrophe and assume both $y$ and $x$ are mean-centered. For a particular observation $y_i$, our line predicts it to be $x_i b_1$. This implies that the error is $\epsilon_i = y_i - x_i b_1$, and the sum of all squared errors is We want to find the value for $b_1$, call it $\hat{b}_1$, that minimizes this quantity; that is, we must solve We could use fancy algorithms like gradient descent, but we can also engage in some good old high school mathematics and minimize the expression analytically. We note that the expression is quadratic, and thus has a single minimum, and this happens when the derivative is zero. Alas, to work! where $\sum_{i=1}^n y_i x_i$ is the (scaled by $n$) covariance between x and y, and $\sum_{i=1}^n x_i^2$ is the (scaled by $n$) variance of x.4 Least squares II: Projection Another way to think about this problem is geometrically. This requires some linear algebra, and so we better write the system of equations in matrix form. For ease of exposure, we again mean-center the data. First, note that the errors in matrix form yield and that the errors are perpendicular to the x-axis, that is, they are at a 90 degree angle of each other; see the Figure on the left. This means that the dot product of the vector of errors and the x-axis points is zero, i.e., or $\mathbf{x}^T \mathbf{\epsilon} = 0$, in short. Using this geometric insight, we can derive the least squares solution as follows which yields the same result as above.5 As an important special case, note the least square solution to a system of equations with only the intercept $b_0$ as unknown, i.e., $y_i = b_0$, yields the mean of $y$. It is this fact that Gauss used to justify the Gaussian distribution as an error distribution, see below. Gauss, Laplace, and “how good is best?” The method of least squares yields the “best” fitting line in the sense that it minimizes the sum of squared errors. But without any statements about the stochastic nature of the errors $\mathbf{\epsilon}$, the question of “how good is best?” remains unanswered. It was Carl Friedrich Gauss who in 1809 couched the least squares problem in probabilistic terms. Specifically, he assumed that each error term $\epsilon_i$ comes from some distribution $\phi$. Using this distribution, the probability (density) of a particular $\epsilon_i$ is large when $\epsilon_i$ is small, that is, when observed and predicted value are close together. Further assuming that the errors are independent and identically distributed, he wanted to find the parameter values which maximize that is, maximize the probability of the errors being small (see also Stigler, 1986, p. 141).6 All that is left now is to find the distribution $\phi$. Gauss noticed that he could make some general statements about $\phi$, namely that it should be symmetric and have its maximum at 0. He then assumed that the mean should be the best value for summarizing $n$ measurements $(y_1, \ldots, y_n)$; that is, he assumed that maximizing $\Omega$ should lead to the same solution as minimizing the sum of squared errors when we have one unknown.7 With this circularity — to justify least squares, I assume least squares — he proved that the distribution must be of the form where $\sigma^2$ is the variance8; see the Figure below for three examples. The distribution has become known as the Gaussian distribution, although — in the spirit of Stigler’s law of eponomy9 — de Moivre and Laplace have discovered it before Gauss (see also Stahl, 2006). Karl Pearson popularized the term normal distribution, an act for which he seems to have shown some regret: “Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another ‘abnormal’.” (Pearson, 1920, p. 25) Using the Gaussian distribution, the maximization problem becomes Note that the value at which $\Omega$ takes its maximum does not change when we drop the constants and take logarithms. This results in $-\sum_{i=1}^n \epsilon_i^2$ as being the expression to be maximized, which is the same as minimizing its negation, that is, minimizing the sum of squared errors. The “Newton of France”, Pierre Simone de Laplace, took notice of Gauss’ argument in 1810 and rushed to give the Gaussian error curve a much more beautiful justification. If we take the errors to be themselves aggregates of many (tiny) perturbing influences, then they will be normally distributed by the central limit theorem. So what is this central limit theorem, anyway? The central limit theorem The central limit theorem is one of the most stunning theorems of statistics. In the poetic words of Francis Galton “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66) The theorem basically says that if you have a sequence of independent and identically distributed random variables — what Galton calls “the mob” — and if that sequence has finite variance, than the mean of this sequence, as $n$ grows larger and larger — “the greater the apparent anarchy” — will get closer and closer to a normal distribution. As $n \rightarrow \infty$, the mean in fact converges in distribution to the normal distribution.10 Laplace realized that, if one takes the errors in the least squares problem to be themselves aggregates (i.e., means) of small influences, then they will be normally distributed. This provides an elegant justification for the least squares solution. To illustrate, assume that a particular error $\epsilon_i$ is in fact the average of $m = 500$ small irregularities that are independent and identically distributed; for instance, assume these influences follow a uniform distributions. Let’s say we have $n = 200$ observations, thus 200 individual errors. The R code and Figure below illustrate that the error distribution will tend to be Gaussian. See this interview and Martin Ford’s new book in which he interviews 23 leading voices in AI, one of them being Pearl. (The comment about curve fitting is on p. 366). ↩ This blog post was in part inspired by Neil Lawrence’s talk at the Gaussian Process summer school last year. You can view it here. ↩ I think it is fair to say that this ‘error’ is a catch-all term, expressing our ignorance or lack of knowledge. This blog post argues that, by virtue of introducing error, all statistical models are misspecified. ↩ The generalization to higher dimensions is straightforward. Instead of simple derivatives, we have to take partial derivatives, one for each parameter $b$. When expressed in matrix form, the solution yields $\mathbf{b} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T y$, where $\mathbf{x}$ is a $n \times p$ matrix and $\mathbf{b}$ is a $p \times 1$ vector. ↩ The generalization to higher dimensions is straightforward. Simply let $\mathbf{x}$ to be a $n \times p$ matrix and $\mathbf{b}$ be a $p \times 1$ vector, where $p$ is the number of dimensions. The derivation is exactly the same, except that because $\mathbf{x}^T \mathbf{x}$ is not a scalar anymore but a matrix, we have to left-multiply by its inverse. This yields $\mathbf{b} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T y$. ↩ To arrive at this maximization criterion, Gauss used Bayes’ rule. Inspired by Laplace, he assumed a uniform prior over the parameters and chose the mode of the posterior distribution over the parameters as his estimator; this is equivalent to maximum likelihood estimation, see also Stigler (2008). ↩ Note that while I have mostly talked about “fitting lines” or “predicting values”, the words used during the discovery of least squares where “summarizing” or “aggregating” observations. For instance, they would say that, absent other information, the mean summarizes the data best while I would, using the language of this blog post, be forced to say the mean predicts the data best. I think that “summarizing” is more adequate than “predict”, especially since we are not predicting out of sample (see also Yarkoni & Westfall, 2017). ↩ Gauss introduced a different parameterization using precision instead of variance. The parameterization using variance was introduced by Fisher. (Karl Pearson used the standard deviation before.) ↩ The law states that “no scientific discovery is named after its original discoverer” (Stigler, 1980, p. 147). ↩ The proof is not difficult, and can be found in Blitzstein & Hwang (2014, p. 436), which is an amazing book on probability theory full of gems. I wholeheartedly recommend working through Blitzstein’s Stat 110 class — it’s one of the best classes I ever took (online, and in general). ↩