Jekyll2020-03-26T13:10:42+00:00https://fabiandablander.com/feed.xmlFabian DablanderPhD Student Methods & StatisticsFabian DablanderInfectious diseases and nonlinear differential equations2020-03-22T12:30:00+00:002020-03-22T12:30:00+00:00https://fabiandablander.com/r/Nonlinear-Infection<p>Last summer, I wrote about <a href="https://fabiandablander.com/r/Linear-Love.html">love affairs and linear differential equations</a>. While the topic is cheerful, linear differential equations are severely limited in the types of behaviour they can model. In this blog post, which I spent writing in self-quarantine to prevent further spread of SARS-CoV-2 — take that, cheerfulness — I introduce nonlinear differential equations as a means to model infectious diseases. In particular, we will discuss the simple SIR and SIRS models, the building blocks of many of the more complicated models used in epidemiology.</p>
<p>Before doing so, however, I discuss some of the basic tools of nonlinear dynamics applied to the logistic equation as a model for population growth. If you are already familiar with this, you can skip ahead. If you have had no prior experience with differential equations, I suggest you first check out my <a href="https://fabiandablander.com/r/Linear-Love.html">earlier post</a> on the topic.</p>
<p>I should preface this by saying that I am not an epidemiologist, and that no analysis I present here is specifically related to the current SARS-CoV-2 pandemic, nor should anything I say be interpreted as giving advice or making predictions. I am merely interested in differential equations, and as with love affairs, infectious diseases make a good illustrating case. So without further ado, let’s dive in!</p>
<h1 id="modeling-population-growth">Modeling Population Growth</h1>
<p>Before we start modeling infectious diseases, it pays to study the concepts required to study nonlinear differential equations on a simple example: modeling population growth. Let $N > 0$ denote the size of a population and assume that its growth depends on itself:</p>
<script type="math/tex; mode=display">\frac{dN}{dt} = \dot{N} = r N \enspace .</script>
<p>As shown in a <a href="https://fabiandablander.com/r/Linear-Love.html">previous blog post</a>, this leads to exponential growth for $r > 0$:</p>
<script type="math/tex; mode=display">N(t) = N_0 e^{r t} \enspace ,</script>
<p>where $N_0 = N(0)$ is the initial population size at time $t = 0$. The figure below visualizes the differential equation (left panel) and its solution (right panel) for $r = 1$ and an initial population of $N_0 = 2$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>This is clearly not a realistic model since the growth of a population depends on resources, which are finite. To model finite resources, we write:</p>
<script type="math/tex; mode=display">\dot{N} = rN \left(1 - \frac{N}{K}\right) \enspace ,</script>
<p>where $r > 0$ and $K$ is the so-called <em>carrying capacity</em>, that is, the maximum sized population that can be sustained by the available resources. Observe that as $N$ grows and if $K > N$, then $(1 - N / K)$ gets smaller, slowing down the growth rate $\dot{N}$. If on the other hand $N > K$, then the population needs more resources than are available, and the growth rate becomes negative, resulting in population decrease.</p>
<p>For simplicity, let $K = 1$ and interpret $N \in [0, 1]$ as the proportion with respect to the carrying capacity; that is, $N = 1$ implies that we are at carrying capacity. The figure below visualizes the differential equation and its solution for $r = 1$ and an initial condition $N_0 = 0.10$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In contrast to exponential growth, the logistic equation leads to sigmoidal growth which approaches the carrying capacity. This is much more interesting behaviour than the linear differential equation above allows. In particular, the logistic equation has two <em>fixed points</em> — points at which the population neither increases nor decreases but stays fixed, that is, where $\dot{N} = 0$. These occur at $N = 0$ and at $N = 1$, as can be inferred from the left panel in the figure above.</p>
<h2 id="analyzing-the-stability-of-fixed-points">Analyzing the Stability of Fixed Points</h2>
<p>What is the stability of these fixed points? Intuitively, $N = 0$ should be unstable; if there are individuals, then they procreate and the population increases. Similarly, $N = 1$ should be stable: if $N < 1$, then $\dot{N} > 0$ and the population grows towards $N = 1$, and if $N > 1$, then $\dot{N} < 0$ and individuals die until $N = 1$.</p>
<p>To make this argument more rigorous, and to get a more quantitative assessment of how quickly perturbations move away from or towards a fixed point, we derive a differential equation for these small perturbations close to the fixed point (see also Strogatz, 2015, p. 24). Let $N^{\star}$ denote a fixed point and define $\eta(t) = N(t) - N^{\star}$ to be a small perturbation close to the fixed point. We derive a differential equation for $\eta$ by writing:</p>
<script type="math/tex; mode=display">\frac{d\eta}{dt} = \frac{d}{dt}\left(N(t) - N^{\star}\right) = \frac{dN}{dt} \enspace ,</script>
<p>since $N^{\star}$ is a constant. This implies that the dynamics of the perturbation equal the dynamics of the population. Let $f(N)$ denote the differential equation for $N$, observe that $N = N^{\star} + \eta$ such that $\dot{N} = \dot{\eta} = f(N) = f(N^{\star} + \eta)$. Recall that $f$ is a nonlinear function, and nonlinear functions are messy to deal with. Thus, we simply pretend that the function is linear close to the fixed point. More precisely, we approximate $f$ around the fixed point using a Taylor series (see <a href="https://www.youtube.com/watch?v=3d6DsjIBzJ4">this excellent video</a> for details) by writing:</p>
<script type="math/tex; mode=display">f(N^{\star} + \eta) = f(N^{\star}) + \eta f'(N^{\star}) + \mathcal{O}(\eta^2) \enspace ,</script>
<p>where we have ignored higher order terms. Note that, by definition, there is no change at the fixed point, that is, $f(N^{\star}) = 0$. Assuming that $f’(N^{\star}) \neq 0$ — as otherwise the higher-order terms matter, as there would be nothing else — we have that close to a fixed point</p>
<script type="math/tex; mode=display">\dot{\eta} \approx \eta f'(N^{\star}) \enspace ,</script>
<p>which is a linear differential equation with solution:</p>
<script type="math/tex; mode=display">\eta(t) = \eta_0 e^{f'(N^{\star})t} \enspace .</script>
<p>Using this trick, we can assess the stability of $N^{\star}$ as follows. If $f’(N^{\star}) < 0$, the small perturbation $\eta(t)$ around the fixed point decays towards zero, and so the system returns to the fixed point — the fixed point is stable. On the other hand, if $f’(N^{\star}) > 0$, then the small perturbation $\eta(t)$ close to the fixed point grows, and so the system does not return to the fixed point — the fixed point is unstable. Applying this to our logistic equation, we see that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f'(N) &= \frac{d}{dN} \left(rN(1 - N)\right) \\[0.50em]
&= \frac{d}{dN} \left(rN - rN^2\right) \\[0.50em]
& = r - 2rN \\[0.50em]
&= r(1 - 2N) \enspace .
\end{aligned} %]]></script>
<p>Plugging in our two fixed points $N^{\star} = 0$ and $N^{\star} = 1$, we find that $f’(0) = r$ and $f’(1) = -r$. Since $r > 0$, this confirms our suspicion that $N^{\star} = 0$ is unstable and $N^{\star} = 1$ is stable. In addition, this analysis tells us how quickly the perturbations grow or decay; for the logistic equation, this is given by $r$.</p>
<p>In sum, we have linearized a nonlinear system close to fixed points in order to assess the stability of these fixed points, and how quickly perturbations close to these fixed points grow or decay. This technique is called <em>linear stability analysis</em>. In the next two sections, we discuss two ways to solve differential equations using the logistic equation as an example.</p>
<h2 id="analytic-solution">Analytic Solution</h2>
<p>In contrast to linear differential equations, which was the topic of a <a href="https://fabiandablander.com/r/Linear-Love.html">previous blog post</a>, nonlinear differential equations can usually not be solved analytically; that is, we generally cannot get an expression that, given an initial condition, tells us the state of the system at any time point $t$. The logistic equation can, however, be solved analytically and it might be instructive to see how. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{dN}{dt} &= rN (1 - N) \\
\frac{dN}{N(1 - N)} &= r dt \\
\int \frac{1}{N(1 - N)} dN &= r t \enspace .
\end{aligned} %]]></script>
<p>Staring at this for a bit, we realize that we can use partial fractions to split the integral. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\int \frac{1}{N(1 - N)} dN &= r t \\[0.50em]
\int \frac{1}{N} dN + \int \frac{1}{1 - N}dN &= rt \\[0.50em]
\text{log}N - \text{log}(1 - N) + Z &= rt \\[0.50em]
e^{\text{log}N - \text{log}(1 - N) + Z} &= e^{rt} \enspace .
\end{aligned} %]]></script>
<p>The exponents and the logs cancel each other nicely. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{e^{\text{log}N}}{e^{\text{log}(1 - N)}}e^Z &= e^{rt} \\[0.50em]
\frac{N}{1 - N} e^Z &= e^{rt} \\[0.50em]
\frac{N}{1 - N} &= e^{rt - Z} \\[0.50em]
N &= e^{rt - Z} - N e^{rt - Z} \\[0.50em]
N\left(1 + e^{rt - Z}\right) &= e^{rt - Z} \\[0.50em]
N &= \frac{e^{rt - Z}}{1 + e^{rt - Z}} \enspace .
\end{aligned} %]]></script>
<p>One last trick is to multiply by $e^{-rt + Z}$, which yields:</p>
<script type="math/tex; mode=display">N = \frac{\left(e^{-rt + Z}\right)\left(e^{rt - Z}\right)}{\left(e^{-rt + Z}\right) + {\left(e^{-rt + Z}\right)\left(e^{-rt + Z}\right)}} = \frac{1}{1 + e^{-rt + Z}} \enspace ,</script>
<p>where $Z$ is the constant of integration. To solve for it, we need the initial condition. Suppose that $N(0) = N_0$, which, using the third line in the derivation above and the fact that $t = 0$, leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{log}N_0 - \text{log}(1 - N_0) + Z &= 0 \\[0.50em]
\text{log}N_0 - \text{log}(1 - N_0) &= -Z \\[0.50em]
\frac{N_0}{1 - N_0} = e^{-Z} \\[0.50em]
\frac{1 - N_0}{N_0} = e^{Z} \enspace .
\end{aligned} %]]></script>
<p>Plugging this into our solution from above yields:</p>
<script type="math/tex; mode=display">N(t) = \frac{1}{1 + e^{-rt + Z}} = \frac{1}{1 + \frac{1 - N_0}{N_0} e^{-rt}} \enspace .</script>
<p>While this was quite a hassle, other nonlinear differential equations are much, much harder to solve, and most do not admit a closed-form solution — or at least if they do, the resulting expression is generally not very intuitive. Luckily, we can compute the time-evolution of the system using numerical methods, as illustrated in the next section.</p>
<h2 id="numerical-solution">Numerical Solution</h2>
<p>A differential equation implicitly encodes how the system we model changes over time. Specifically, given a particular (potentially high-dimensional) state of the system at time point $t$, $\mathbf{x}_t$, we know in which direction and how quickly the system will change because this is exactly what is encoded in the differential equation $f = \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t}$. This suggests the following numerical approximation: Assume we know the state of the system at a (discrete) time point $n$, denoted $x_n$, and that the change in the system is constant over a small interval $\Delta_t$. Then, the position of the system at time point $n + 1$ is given by:</p>
<script type="math/tex; mode=display">\mathbf{x}_{n + 1} = \mathbf{x}_n + \Delta t \cdot f(\mathbf{x}_n) \enspace .</script>
<p>$\Delta t$ is an important parameter, encoding over what time period we assume the change $f$ to be constant. We can code this up in R for the logistic equation:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_logistic</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N0</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">N0</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">)</span><span class="w">
</span><span class="n">dN</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="p">)</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">N</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># Euler</span><span class="w">
</span><span class="n">N</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">N</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dN</span><span class="p">(</span><span class="n">N</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">])</span><span class="w">
</span><span class="c1"># Improved Euler</span><span class="w">
</span><span class="c1"># k <- N[i-1] + delta_t * dN(N[i-1])</span><span class="w">
</span><span class="c1"># N[i] <- N[i-1] + 1 /2 * delta_t * (dN(N[i-1]) + dN(k))</span><span class="w">
</span><span class="c1"># Runge-Kutta 4th order</span><span class="w">
</span><span class="c1"># k1 <- dN(N[i-1]) * delta_t</span><span class="w">
</span><span class="c1"># k2 <- dN(N[i-1] + k1/2) * delta_t</span><span class="w">
</span><span class="c1"># k3 <- dN(N[i-1] + k2/2) * delta_t</span><span class="w">
</span><span class="c1"># k4 <- dN(N[i-1] + k3) * delta_t</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># N[i] <- N[i-1] + 1/6 * (k1 + 2*k2 + 2*k3 + k4)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">N</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Clearly, the accuracy of this approximation is a function of $\Delta t$. To see how, the left panel shows the approximation for various values of $\Delta t$, while the right panel shows the (log) absolute error as a function of (log) $\Delta t$. The error is defined as:</p>
<script type="math/tex; mode=display">E = |N(10) - \hat{N}(10)| \enspace ,</script>
<p>where $\hat{N}$ is the Euler approximation.</p>
<!-- The figure gives some intuition how the accuracy of the approximation changes as we change $\Delta_t$ and the approximation method. In particular, the left panel shows the Euler approximation for various $\Delta t$, while the right panel shows the approximation for the Runga-Kutta method (see commented out code above). -->
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>The right panel approximately shows the relationship:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{log } E &\propto \text{log } \Delta t \\[0.50em]
E &\propto \Delta t \enspace .
\end{aligned} %]]></script>
<p>Therefore, the error goes down linearly with $\Delta t$. Other methods, such as the improved Euler method or <a href="https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods">Runge-Kutta solvers</a> (see commented out code above) do better. However, it is ill-advised to choose $\Delta t$ extremely small, because this leads to an increase in computation time and can lead to accuracy errors which get exacerbated over time.</p>
<!-- We see that the [Runge-Kutta method](https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods) (of $4^{\text{th}}$ order) performs better. While the figure shows that the error is drastically reduced with smaller step sizes $\Delta t$, it is ill-advised to choose $\Delta t$ extremely small: Decreasing $\Delta t$ leads to comparatively more computations, and this increases computation time but also can lead to accuracy errors which get exacerbated over time. -->
<p>In summary, we have seen that nonlinear differential equations can model interesting behaviour such as multiple fixed points; how to classify the stability of these fixed points using linear stability analysis; and how to numerically solve nonlinear differential equations. In the remainder of this post, we study coupled nonlinear differential equations — the SIR and SIRS models — as a way to model the spread of infectious diseases.</p>
<h1 id="modeling-infectious-diseases">Modeling Infectious Diseases</h1>
<p>Many models have been proposed as tools to understand epidemics. In the following sections, I focus on the two simplest ones: the SIR and the SIRS model (see also Hirsch, Smale, Devaney, 2013, ch. 11).</p>
<h2 id="the-sir-model">The SIR Model</h2>
<p>We use the SIR model to understand the spread of infectious diseases. The SIR model is the most basic <em>compartmental</em> model, meaning that it groups the overall population into distinct sub-populations: a susceptible population $S$, an infected population $I$, and a recovered population $R$. We make a number of further simplifying assumptions. First, we assume that the overall population is $1 = S + I + R$ so that $S$, $I$, and $R$ are proportions. We further assume that the overall population does not change, that is,</p>
<script type="math/tex; mode=display">\frac{d}{dt} \left(S + I + R\right) = 0 \enspace .</script>
<p>Second, the SIR model assumes that once a person has been infected and has recovered, the person cannot become infected again — we will relax this assumption later on. Third, the model assumes that the rate of transmission of the disease is proportional to the number of encounters between susceptible and infected persons. We model this by setting</p>
<script type="math/tex; mode=display">\frac{dS}{dt} = - \beta IS \enspace ,</script>
<p>where $\beta > 0$ is the rate of infection. Fourth, the model assumes that the growth of the recovered population is proportional to the proportion of people that are infected, that is,</p>
<script type="math/tex; mode=display">\frac{dR}{dt} = \gamma I \enspace ,</script>
<p>where the $\gamma > 0$ is the recovery rate. Since the overall population is constant, these two equations naturally lead to the following equation for the infected:</p>
<script type="math/tex; mode=display">\begin{aligned}
\frac{d}{dt} \left(S + I + R\right) = 0 \\[0.50em]
\frac{dI}{dt} = - \frac{dS}{dt} - \frac{dR}{dt} \\[0.50em]
\frac{dI}{dt} = \beta IS - \gamma I \enspace .
\end{aligned}</script>
<p>where $\beta I S$ gives the proportion of newly infected individuals and $\gamma I$ gives the proportion of newly recovered individuals. Observe that since we assumed that the overall population does not change, we only need to focus on two of these subgroup, since $R(t) = 1 - S(t) - I(t)$. The system is therefore fully characterized by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{dS}{dt} &= - \beta IS \\[0.50em]
\frac{dI}{dt} &= \beta IS - \gamma I \enspace .
\end{aligned} %]]></script>
<p>Before we analyze this model mathematically, let’s implement Euler’s method and visualize some trajectories.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_SIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">dimnames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">)))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">S0</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">I0</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dS</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot_SIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="s1">'Set1'</span><span class="p">)</span><span class="w">
</span><span class="n">matplot</span><span class="p">(</span><span class="w">
</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Subpopulations(t)'</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Time t'</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">4000</span><span class="p">),</span><span class="w">
</span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.75</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w">
</span><span class="n">font.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">xaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">yaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="w">
</span><span class="m">3000</span><span class="p">,</span><span class="w"> </span><span class="m">0.65</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">),</span><span class="w">
</span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below shows trajectories for a fixed recovery rate of $\gamma = 1/8$ and an increasing rate of infection $\beta$ for the initial condition $S_0 = 0.95$, $I_0 = 0.05$, and $R_0 = 0$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>For $\beta = 1/8$, no outbreak occurs (left panel). Instead, the proportion of susceptible and infected people monotonically decrease while the proportion of recovered people monotonically increases. The middle panel, on the other hand, shows a small outbreak. The proportion of infected people rises, but then falls again. Similarly, the right panel shows an outbreak as well, but a more severe one, as the proportion of infected people rises more starkly before it eventually decreases again.</p>
<p>How do things change when we change the recovery rate $\gamma$? The figure below shows again three cases of trajectories for the same initial condition, but for a smaller recovery rate $\gamma = 1/12$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>We again observe no outbreak in the left panel, and outbreaks of increasing severity in both the middle and the right panel. In contrast to the results for $\gamma = 1/8$, the outbreak is more severe, as we would expect since the recovery rate with $\gamma = 1/12$ is now lower. In fact, whether an outbreak occurs or not and how severe it will be depends not on $\beta$ and $\gamma$ alone, but on their ratio. This ratio is known as $R_0 = \beta / \gamma$, pronounced “R-naught”. (Note the unfortunate choice of well-established terminology in this context, as $R_0$ also denotes the initial proportion of recovered people; it should be clear from the context which one is meant, however.) We can think of $R_0$ as the average number of people an infected person will infect before she gets better. If $R_0 > 1$, an outbreak occurs. In the next section, we look for the fixed points of this system and assess their stability.</p>
<h2 id="analyzing-fixed-points">Analyzing Fixed Points</h2>
<p>A glance at the above figures suggests that the SIR model allows for multiple stable states. The left panels, for example, show that if there is no outbreak, the proportion of susceptible people stays above the proportion of recovered people. If there is an outbreak, however, then it always fades and the proportion of recovered people will be higher than the proportion of susceptible people; how much higher depends on the severity of the outbreak.</p>
<p>While we could play around some more with visualisations, it pays to do a formal analysis. Note that in contrast to the logistic equation, which only modelled a single variable — population size — an analysis of the SIR model requires us to handle two variables, $S$ and $I$; the third one, $R$, follows from the assumption of a constant population size. At the fixed points, nothing changes, that is, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= - \beta IS \\[0.50em]
0 &= \beta IS - \gamma I \enspace .
\end{aligned} %]]></script>
<p>This can only happen when $I = 0$, irrespective of the value of $S$. In other words, all $(I^{\star}, S^{\star}) = (0, S)$ are fixed points; if nobody is infected, the disease cannot spread — and so everybody stays either susceptible or recovered. To assess the stability of these fixed points, we again derive a differential equation for the perturbations close to the fixed point. However, note that in contrast to the one-dimensional case studied above, perturbations can now be with respect to $I$ or to $S$. Let $u = S - S^{\star}$ and $v = I - I^{\star}$ be the respective perturbations, and let $\dot{S} = f(S, I)$ and $\dot{I} = g(S, I)$. We first derive a differential equation for $u$, writing:</p>
<script type="math/tex; mode=display">\dot{u} = \frac{d}{dt}\left(S - S^{\star}\right) = \dot{S} \enspace ,</script>
<p>since $S^{\star}$ is a constant. This implies that $u$ behaves as $S$. In contrast to the one-dimensional case above, we have two <em>coupled</em> differential equations, and so we have to take into account how $u$ changes as a function of both $S$ and $I$. We Taylor expand at the fixed point $(S^{\star}, I^{\star})$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\dot{u} &= f(u + S^{\star}, v + I^{\star}) \\[0.50em]
&= f(S^{\star}, I^{\star}) + u \frac{\partial f}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial f}{\partial I}_{(S^{\star}, I^{\star})} + \mathcal{O}(u^2, v^2, uv) \\[0.50em]
&\approx u \frac{\partial f}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial f}{\partial I}_{(S^{\star}, I^{\star})} \enspace ,
\end{aligned} %]]></script>
<p>since $f(S^{\star}, I^{\star}) = 0$ and we drop higher-order terms. Note that taking the partial derivative of $f$ with respect to $S$ (or $I$) yields a function, and the subscripts $(S^{\star}, I^{\star})$ mean that we evaluate this function at the fixed point $(S^{\star}, I^{\star})$. We can similarly derive a differential equation for $v$:</p>
<script type="math/tex; mode=display">\dot{v} \approx u \frac{\partial g}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial g}{\partial I}_{(S^{\star}, I^{\star})} \enspace .</script>
<p>We can write all of this concisely using matrix algebra:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
\dot{u} \\
\dot{v}
\end{pmatrix} =
\begin{pmatrix}
\frac{\partial f}{\partial S} & \frac{\partial f}{\partial I} \\
\frac{\partial g}{\partial S} & \frac{\partial g}{\partial I}
\end{pmatrix}_{(S^{\star}, I^{\star})}
\begin{pmatrix}
u \\
v
\end{pmatrix} \enspace , %]]></script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
J = \begin{pmatrix}
\frac{\partial f}{\partial S} & \frac{\partial f}{\partial I} \\
\frac{\partial g}{\partial S} & \frac{\partial g}{\partial I}
\end{pmatrix}_{(S^{\star}, I^{\star})} %]]></script>
<p>is called the <em>Jacobian matrix</em> at the fixed point $(S^{\star}, I^{\star})$. The Jacobian gives the linearized dynamics close to a fixed point, and therefore tells us how perturbations will evolve close to a fixed point.</p>
<p>In contrast to unidimensional systems, where we simply check whether the slope is positive or negative, that is, whether $f’(x^\star) < 0$ or $f’(x^\star) > 0$, the test for whether a fixed point is stable is slightly more complicated in multidimensional settings. In fact, and not surprisingly, since we have <em>linearized</em> this nonlinear differential equation, the check is the same as in <a href="https://fabiandablander.com/r/Linear-Love.html">linear systems</a>: we compute the eigenvalues $\lambda_1$ and $\lambda_2$ of $J$, observing that negative eigenvalues mean exponential decay and positive eigenvalues mean exponential growth along the directions of the respective eigenvectors. (Note that this does not work for all types of fixed points, see Strogatz (2015, p. 152).)</p>
<p>What does this mean for our SIR model? First, let’s derive the Jacobian:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
J &= \begin{pmatrix}
-\frac{\partial}{\partial S} \beta I S & -\frac{\partial }{\partial I} \beta I S \\
\frac{\partial}{\partial S} \left(\beta I S - \gamma I\right) & \frac{\partial}{\partial I} \left(\beta I S - \gamma I\right) \\[0.5em]
\end{pmatrix} \\[1em]
& =
\begin{pmatrix}
-\beta I & -\beta S \\
\beta I & \beta S - \gamma
\end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>Evaluating this at the fixed point $(S^{\star}, I^{\star}) = (S, 0)$ results in:</p>
<script type="math/tex; mode=display">% <![CDATA[
J_{(S, 0)} = \begin{pmatrix} 0 & -\beta S \\ 0 & \beta S - \gamma \end{pmatrix} \enspace . %]]></script>
<p>Since this matrix is upper triangular — all entries below the diagonal are zero — the eigenvalues are given by the diagonal, that is, $\lambda_1 = 0$ and $\lambda_2 = \beta S - \gamma$. $\lambda_1 = 0$ implies a constant solution, while $\lambda_2 > 0$ implies exponential growth and $\lambda_2 < 0$ exponential decay of the perturbations close to the fixed point. Observe that $\lambda_2$ is not only a function of the parameters $\beta$ and $\gamma$, but also of the proportion of susceptible individuals $S$. We find that $\lambda_2 > 0$ for $S > \gamma / \beta$, which results in an unstable fixed point. On the other hand, we have that $\lambda_2 < 0$ for $S < \gamma / \beta$, which results in a stable fixed point. In the next section, we will use vector fields in order to get more intuition for the dynamics of the system.</p>
<h2 id="vector-field-and-nullclines">Vector Field and Nullclines</h2>
<p>A vector field shows for any position $(S, I)$ in which direction the system moves, which we indicate by the head of an arrow, and how quickly, which we indicate by the length of an arrow. We use the R code below to visualize such a vector field and selected trajectories on it.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'fields'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_vectorfield_SIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">expand.grid</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">))</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">SI</span><span class="p">[</span><span class="n">apply</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="c1"># S + I <= 1 must hold</span><span class="w">
</span><span class="n">dSI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">dS</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
</span><span class="n">draw_vectorfield</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">draw_vectorfield</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="w">
</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w">
</span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.2</span><span class="p">,</span><span class="w"> </span><span class="m">1.2</span><span class="p">),</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-0.1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">arrow.plot</span><span class="p">(</span><span class="w">
</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w">
</span><span class="n">arrow.ex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.075</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.05</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray82'</span><span class="p">,</span><span class="w"> </span><span class="n">xpd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">cx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="n">cn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">1.05</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">-.075</span><span class="p">,</span><span class="w"> </span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cn</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.05</span><span class="p">,</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cn</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.03</span><span class="p">,</span><span class="w"> </span><span class="m">-.04</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cx</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.03</span><span class="p">,</span><span class="w"> </span><span class="m">.975</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cx</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">0.995</span><span class="p">,</span><span class="w"> </span><span class="m">-0.04</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cx</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>For $\beta = 1/8$ and $\gamma = 1/8$, we know from above that no outbreak occurs. The vector field shown in the left panel below further illustrates that, since $S \leq \gamma / \beta = 1$, all fixed points $(S^{\star}, I^{\star}) = (S, 0)$ are stable. In contrast, we know that $\beta = 3/8$ and $\gamma = 1/8$ result in an outbreak. The vector field shown in the right panel below indicates that fixed points with $S > \gamma / \beta = 1/3$ are unstable, while fixed points with $S < 1/3$ are stable; the dotted line is $S = 1/3$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-9-1.png" title="plot of chunk unnamed-chunk-9" alt="plot of chunk unnamed-chunk-9" style="display: block; margin: auto;" /></p>
<p>Can we find some structure in such vector fields? One way to “organize” them is by drawing so-called <em>nullclines</em>. In our case, the $I$-nullcline gives the set of points for which $\dot{I} = 0$, and the $S$-nullcline gives the set of points for which $\dot{S} = 0$. We find these points in a similar manner to finding fixed points, but instead of setting both $\dot{S}$ and $\dot{I}$ to zero, we tackle them one at a time.</p>
<p>The $S$-nullclines are given by the $S$- and the $I$-axes, because $\dot{S} = 0$ when $S = 0$ or when $I = 0$. Along the $I$-axis axis we have $\dot{I} = - \gamma I$ since $S = 0$, resulting in exponential decay of the infected population; this indicated by the grey arrows along the $I$-axis which are of progressively smaller length the closer they approach the origin.</p>
<p>The $I$-nullclines are given by $I = 0$ and by $S = \gamma / \beta$. For $I = 0$, we have $\dot{S} = 0$ and so these yield fixed points. For $S = \gamma / \beta$ we have $\dot{S} = - \gamma I$, resulting in exponential decay of the susceptible population, but since $\dot{I} = 0$, the proportion of infected people does not change; this is indicated in the left vector field above, where we have horizontal arrows at the dashed line given by $S = \gamma / \beta$. However, this only holds for the briefest of moments, since $S$ decreases and for $S < \gamma / \beta$ we again have $\dot{I} < 0$, and so the proportion of infected people goes down to the left of the line. Similarly, to the right of the line we have $S > \gamma / \beta$, which results in $\dot{I} > 0$, and so the proportion of infected people grows.</p>
<p>In summary, we have seen how the SIR model allows for outbreaks whenever the rate of infection is higher than the rate of recovery, $R_0 > \beta / \gamma$. If this occurs, then we have a growing proportion of infected people while $S > \gamma / \beta$. As illustratd by the vector field, the proportion of susceptible people $S$ decreases over time. At some point, therefore, we have that $S < \gamma / \beta$, resulting in a decrease in the proportion of infected people until finally $I = 0$. Observe that, in the SIR model, infections always die out. In the next section, we extend the SIR model to allow for diseases to become established in the population.</p>
<!-- The figure below shows the vector field for $\beta = 4$ and $\gamma = 1$; the nullclines are given by the black solid lines. As predicted, for any $S_0 > 1/4$ an epidemic occurs, that is, the number of infected people grows. After passing $S = 1/4$, the number of infected people decreases until it reaches a fixed point where $I = 0$. -->
<!-- ```{r, echo = FALSE, warning = FALSE, fig.align = 'center', fig.width = 8, fig.height = 8, dpi=400} -->
<!-- par(mar = c(0, 0, 0, 0)) -->
<!-- b <- 4/8 -->
<!-- g <- 1/8 -->
<!-- plot_vectorfield_SIR(beta = b, gamma = g, main = expression(beta ~ ' = 4/8,' ~ gamma ~ ' = 1/8')) -->
<!-- plot_trajectory_SIR(0.95, 0.01, beta = b, gamma = g) -->
<!-- plot_trajectory_SIR(0.8, 0.01, beta = b, gamma = g) -->
<!-- plot_trajectory_SIR(0.65, 0.01, beta = b, gamma = g) -->
<!-- plot_trajectory_SIR(0.5, 0.01, beta = b, gamma = g) -->
<!-- lines(c(1/4, 1/4), c(0, 1), lty = 2, lwd = 1) -->
<!-- # stable <- seq(0, g/b - .05, .05) -->
<!-- # unstable <- seq(g/b, 1, .05) -->
<!-- # points(x = unstable, y = rep(0, length(unstable)), cex = 1.3) -->
<!-- # points(x = seq(g/b, 1, .05), y = rep(0, length(unstable)), cex = 1.5, pch = 20, col = 'white') -->
<!-- # points(x = stable, y = rep(0, length(stable)), pch = 20, cex = 1.5) -->
<!-- ``` -->
<h2 id="the-sirs-model">The SIRS Model</h2>
<p>The SIR model assumes that once infected people are immune to the disease forever, and so any disease occurs only once and then never comes back. More interesting dynamics occur when we allow for the reinfection of recovered people; we can then ask, for example, under what circumstances the disease becomes established in the population. The SIRS model extends the SIR model, allowing the recovered population to become susceptible again (hence the extra ‘S’). It assumes that the susceptible population increases proportional to the recovered population such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{dS}{dt} &= - \beta IS + \mu R \\[0.50em]
\frac{dI}{dt} &= \beta IS - \gamma I \\[0.50em]
\frac{dR}{dt} &= \gamma I - \mu R\enspace ,
\end{aligned} %]]></script>
<p>where, since we added $\mu R$ to the change in the proportion of susceptible people, we had to subtract $\mu R$ from the change in the proportion of recovered people. We again make the simplifying assumption that the overall population does not change, and so it suffices to study the following system:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{dS}{dt} &= - \beta IS + \mu R \\[0.50em]
\frac{dI}{dt} &= \beta IS - \gamma I \enspace ,
\end{aligned} %]]></script>
<p>since $R(t) = 1 - S(t) - I(t)$. We adjust our implementation of Euler’s method:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_SIRS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">dimnames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">)))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">S0</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">I0</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">R</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">R</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dS</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot_SIRS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="s1">'Set1'</span><span class="p">)</span><span class="w">
</span><span class="n">matplot</span><span class="p">(</span><span class="w">
</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Subpopulations(t)'</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Time t'</span><span class="p">,</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.75</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="n">font.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">4000</span><span class="p">),</span><span class="w"> </span><span class="n">font.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">xaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">yaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="w">
</span><span class="m">3000</span><span class="p">,</span><span class="w"> </span><span class="m">0.95</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">),</span><span class="w">
</span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below shows trajectories for a fixed recovery rate of $\gamma = 1/8$, a fixed reinfection rate of $\mu = 1/8$, and an increasing rate of infection $\beta$ for the initial condition $S_0 = 0.95$, $I_0 = 0.05$, and $R_0 = 0$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<p>As for the SIR model, we again find that no outbreak occurs for $R_0 = \beta / \gamma < 1$, which is the case for the left panel. Most interestingly, however, we find that the proportion of infected people <em>does not</em>, in contrast to the SIR model, decrease to zero for the other panels. Instead, the disease becomes established in the population when $R_0 > 1$, and the middle and the right panel show different fixed points.</p>
<p>How do things change when we vary the reinfection rate $\mu$? The figure below shows again three cases of trajectories for the same initial condition, but for a smaller reinfection rate $\mu$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>We again find no outbreak in the left panel, and outbreaks of increasing severity in the middle and right panel. Both these outbreaks are less severe compared to the outbreaks in the previous figures, as we would expect given a decrease in the reinfection rate. Similarly, the system seems to stabilize at different fixed points. In the next section, we provide a more formal analysis of the fixed points and their stability.</p>
<h2 id="analyzing-fixed-points-1">Analyzing Fixed Points</h2>
<p>To find the fixed points of the SIRS model, we again seek solutions for which:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= - \beta IS + \mu (1 - S - I) \\[0.50em]
0 &= \beta IS - \gamma I \enspace ,
\end{aligned} %]]></script>
<p>where we have substituted $R = 1 - S - I$ and from which it follows that also $\dot{R} = 0$ since we assume that the overall population does not change. We immediately see that, in contrast to the SIR model, $I = 0$ cannot be a fixed point for <em>any</em> $S$ because of the added term which depends on $\mu$. Instead, it is a fixed point only for $S = 1$. To get the other fixed point, note that the last equation gives $S = \gamma / \beta$, which plugged into the first equation yields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= -I\gamma + \mu\left(1 - \frac{\gamma}{\beta} - I\right) \\[0.50em]
I\gamma &= \mu\left(1 - \frac{\gamma}{\beta}\right) - \mu I \\[0.50em]
I(\gamma + \mu) &= \mu\left(1 - \frac{\gamma}{\beta}\right) \\[0.50em]
I &= \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} \enspace .
\end{aligned} %]]></script>
<p>Therefore, the fixed points are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(S^{\star}, I^{\star}) &= (1, 0) \\[0.50em]
(S^{\star}, I^{\star}) &= \left(\frac{\gamma}{\beta}, \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right) \enspace .
\end{aligned} %]]></script>
<p>Note that the second fixed point does not exist when $\gamma / \beta > 1$, since the proportion of infected people cannot be negative. Another, more intuitive perspective on this is to write $\gamma / \beta > 1$ as $R_0 = \beta / \gamma < 1$. This allows us to see that the second fixed point, which would have a non-zero proportion of infected people in the population, does not exist when $R_0 < 1$, as then no outbreak occurs. We will come back to this in a moment.</p>
<p>To assess the stability of the fixed points, we derive the Jacobian matrix for the SIRS model:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
J &= \begin{pmatrix}
\frac{\partial}{\partial S} \left(-\beta I S + \mu(1 - S - I)\right) & \frac{\partial }{\partial I} \left(-\beta I S + \mu(1 - S - I)\right) \\
\frac{\partial}{\partial S} \left(\beta I S - \gamma I\right) & \frac{\partial}{\partial I} \left(\beta I S - \gamma I\right) \\[0.5em]
\end{pmatrix} \\[1em]
&=
\begin{pmatrix}
-\beta I - \mu & -\beta S - \mu \\
\beta I & \beta S - \gamma
\end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>For the fixed point $(S^{\star}, I^{\star}) = (1, 0)$ we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
J_{(1, 0)} = \begin{pmatrix}
- \mu & -\beta - \mu \\
0 & \beta - \gamma
\end{pmatrix} \enspace , %]]></script>
<p>which is again upper-triangular and therefore has eigenvalues $\lambda_1 = -\mu$ and $\lambda_2 = \beta - \gamma$. This means it is unstable whenever $\beta > \gamma$ since then $\lambda_2 > 0$, and any infected individual spreads the disease. The Jacobian at the second fixed point is:</p>
<script type="math/tex; mode=display">% <![CDATA[
J_{\left(\frac{\gamma}{\beta}, \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right)} = \begin{pmatrix}
-\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} - \mu & -\gamma - \mu \\
\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} & - 2\gamma
\end{pmatrix} \enspace , %]]></script>
<p>which is more daunting. However, we know from the previous blog post that to classify the stability of the fixed point, it suffices to look at the trace $\tau$ and determinant $\Delta$ of the Jacobian, which are given by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\tau &= -\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} - 2\gamma \\[0.50em]
\Delta &= \left(-\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right)\left(-2\gamma\right) - \left(- \gamma - \mu\right)\left(\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right) \\[0.50em]
&= 2\gamma\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} + \beta\mu\left(1 - \frac{\gamma}{\beta}\right) \enspace .
\end{aligned} %]]></script>
<p>The trace can be written as $\tau = \lambda_1 + \lambda_2$ and the determinant can be written as $\Delta = \lambda_1 \lambda_2$, as shown in a <a href="https://fabiandablander.com/r/Linear-Love.html">previous blog post</a>. Here, we have that $\tau < 0$ because both terms above are negative, and $\Delta > 0$ because both terms above are positive. This constrains $\lambda_1$ and $\lambda_2$ to be negative, and thus the fixed point is stable.</p>
<h2 id="vector-fields-and-nullclines">Vector Fields and Nullclines</h2>
<p>As previously done for the SIR model, we can again visualize the directions in which the system changes at any point using a vector field.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_vectorfield_SIRS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">expand.grid</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">))</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">SI</span><span class="p">[</span><span class="n">apply</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="c1"># S + I <= 1 must hold</span><span class="w">
</span><span class="n">dSI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">dS</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
</span><span class="n">draw_vectorfield</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below visualizes the vector field for the SIRS model, several trajectories, and the nullclines for $\gamma = 1/8$ and $\mu = 1/8$ for $\beta = 1/8$ (left panel) and $\beta = 3/8$ (right panel). The left panel shows that there exists only one stable fixed point at $(S^{\star}, I^{\star}) = (1, 0)$ to which all trajectories converge.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-14-1.png" title="plot of chunk unnamed-chunk-14" alt="plot of chunk unnamed-chunk-14" style="display: block; margin: auto;" /></p>
<p>The right panel, on the other hand, shows <em>two</em> fixed points: one unstable fixed point at $(S^{\star}, I^{\star}) = (1, 0)$, which we only reach when $I_0 = 0$, and a stable one at</p>
<script type="math/tex; mode=display">(S^{\star}, I^{\star}) = \left(\frac{1/8}{3/8}, \frac{1/8\left(1 - \frac{3/8}{1/8}\right)}{1/8 + 1/8}\right) = (1/3, 1/3) \enspace .</script>
<p>In contrast to the SIR model, therefore, there exists a stable fixed point constituting a population which includes infected people, and so the disease is not eradicated but stays in the population.</p>
<p>The dashed lines give the nullclines. The $I$-nullcline gives the set of points where $\dot{I} = 0$, which are — as in the SIR model above — given by $I = 0$ and $S = \gamma / \beta$. The $S$-nullcline is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= - \beta I S + \mu(1 - S - I) \\[0.50em]
\beta I S &= \mu(1 - S) - \mu I \\[0.50em]
I &= \frac{\mu(1 - S)}{\beta S + \mu} \enspace ,
\end{aligned} %]]></script>
<p>which is a nonlinear function in $S$. The nullclines help us again in “organizing” the vector field. This can be seen best in the right panel above. In particular, and similar to the SIR model, we will again have a decrease in the proportion of infected people to the left of the line given by $S = \gamma / \beta$, that is, when $S < \gamma / \beta$, and an increase to the right of the line, that is, when $S > \gamma / \beta$. Similarly, the proportion of susceptible people increases when the system is “below” the $S$-nullcline, while it increases when the system is “above” the $S$-nullcline.</p>
<h2 id="bifurcations">Bifurcations</h2>
<p>In the vector fields above we have seen that the system can go from having only one fixed point to having two fixed points. Whenever a fixed point is destroyed or created or changes its stability as an internal parameter is varied — here the ratio of $\gamma / \beta$ — we speak of a <em>bifurcation</em>.</p>
<p>As pointed out above, the second equilibrium point only exists for $\gamma / \beta \leq 1$. As long as $\gamma / \beta < 1$, we have two distinct fixed points. At $\gamma / \beta = 1$, the second fixed point becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(S^{\star}, I^{\star}) &= \left(1, \frac{\mu\left(1 - 1\right)}{\gamma + \mu}\right) = (1, 0) \enspace ,
\end{aligned} %]]></script>
<p>which equals the first fixed point. Thus, at $\gamma / \beta = 1$, the two fixed points merge into one; this is the bifurcation point. This makes sense: if $\gamma / \beta < 1$, we have that $\beta / \gamma > 1$, and so an outbreak occurs, which establishes the disease in the population since we allow for reinfections.</p>
<p>We can visualize this change in fixed points in a so-called <em>bifurcation diagram</em>. A bifurcation diagram shows how the fixed points and their stability change as we vary an internal parameter. Since we deal with two-dimensional fixed points, we split the bifurcation diagram into two: the left panel shows how the $I^{\star}$ part of the fixed point changes as we vary $\gamma / \beta$, and the right panel shows how the $S^{\star}$ part of the fixed point changes as we vary $\gamma / \beta$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-15-1.png" title="plot of chunk unnamed-chunk-15" alt="plot of chunk unnamed-chunk-15" style="display: block; margin: auto;" /></p>
<p>The left panel shows that as long as $\gamma / \beta < 1$, which implies that $\beta / \gamma > 1$, we have two fixed points where the stable fixed point is the one with a non-zero proportion of infected people — the disease becomes established. These fixed points are on the diagonal line, indicates as black dots. Interestingly, this shows that the proportion of infected people can never be stable at a value larger than $1/2$. There also exist unstable fixed points for which $I^{\star} = 0$. These fixed points are unstable because if there even exists only one infected person, she will spread the disease, resulting in more infected people. At the point where $\beta = \gamma$, the two fixed points merge: the disease can no longer be established in the population, and the proportion of infected people always goes to zero.</p>
<p>Similarly, the right panel shows how the fixed points $S^{\star}$ change as a function of $\gamma / \beta$. Since the infection spreads for $\beta > \gamma$, the fixed point $S^{\star} = 1$ is unstable, as the proportion of susceptible people must decrease since they become infected. For outbreaks that become increasingly mild as $\gamma / \beta \rightarrow 1$, the stable proportion of susceptible people increases, reaching $S^{\star} = 1$ when at last $\gamma = \beta$.</p>
<p>In summary, we have seen how the SIRS extends the SIR model by allowing reinfections. This resulted in possibility of more interesting fixed points, which included a non-zero proportion of infected people. In the SIRS model, then, a disease can become established in the population. In contrast to the SIR model, we have also seen that the SIRS model allows for bifuractions, going from two fixed points in times of outbreaks ($\beta > \gamma$) to one fixed point in times of no outbreaks ($\beta < \gamma$).</p>
<!-- model allows for outbreaks whenever the rate of infection is higher than the rate of recovery, $R_0 > \beta / \gamma$. If this occurs, then we have a growing proportion of infected people when $S > \gamma / \beta$. As illustratd by the vector field, the proportion of susceptible people $S$ decreases over time. At some point, therefore, we have that $S < \gamma / \beta$, resulting in a decrease in the proportion of infected people until finally $I = 0$. Observe that, in the SIR model, infections always die out. In the next section, we extend the SIR model to allow for diseases to become established in the population. -->
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen that nonlinear differential equations are a powerful tool to model real-world phenomena. They allow us to model vastly more complicated behaviour than is possible with linear differential equations, yet they rarely provide closed-form solution. Luckily, the time-evolution of a system can be straightforwardly computed with basic numerical techniques such as Euler’s method. Using the simple logistic equation, we have seen how to analyze the stability of fixed points — simply pretend the system is linear close to a fixed point.</p>
<p>The logistic equation has only one state variable — the size of the population. More interesting dynamics occur when variables interact, and we have seen how the simple SIR model can help us understand the spread of infectious disease. Consisting only of two parameters, we have seen that an outbreak occurs only when $R_0 = \beta / \gamma > 1$. Moreover, the stable fixed points always included $I = 0$, implying that the disease always gets eradicated. This is not true for all diseases because recovered people might become reinfected. The SIRS model amends this by introducing a parameter $\mu$ that quantifies how quickly recovered people can become susceptible again. As expected, this led to stable states in which the disease becomes established in the population.</p>
<p>On our journey to understand these systems, we have seen how to quantify the stability of a fixed point using linear stability analysis, how to visualize the dynamics of a system using vector fields, how nullclines give structure to such vector fields, and how bifurcations can drastically change the dynamics of a system.</p>
<p>The SIR and the SIRS models discussed here are without a doubt crude approximations of the real dynamics of the spread of infectious diseases. There exist <a href="https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#Elaborations_on_the_basic_SIR_model">several ways to extend them</a>. One way to do so, for example, is to add an <em>exposed</em> population which are infected but are not yet infectious; see <a href="https://gabgoh.github.io/COVID/index.html">here</a> for a visualization of an elaborated version of this model in the context of SARS-CoV-2. These basic compartment models assume homogeneity of spatial-structure, which is a substantial simplification. There are various ways to include spatial structure (e.g., Watts, 2005; Riley, 2007), but that is for another blog post.</p>
<hr />
<p>I would like to thank <a href="https://twitter.com/theBonferroni">Adam Finnemann</a>, <a href="https://twitter.com/AnToniPichler">Anton Pichler</a>, and <a href="https://twitter.com/Oisin_Ryan_">Oísin Ryan</a> for very helpful comments on this blog post.</p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Strogatz, S. H. (<a href="http://www.stevenstrogatz.com/books/nonlinear-dynamics-and-chaos-with-applications-to-physics-biology-chemistry-and-engineering">2015</a>). Nonlinear Dynamics and Chaos: With applications to Physics, Biology, Chemistry, and Engineering. Colorado, US: Westview Press.</li>
<li>Hirsch, M. W., Smale, S., & Devaney, R. L. (<a href="https://books.google.nl/books?hl=en&lr=&id=rly1AAmAXh8C&oi=fnd&pg=PP1&dq=differential+equations+hirsch+smale&ots=pbe8hf2vQS&sig=XAweKN9n_n00ph33V7heYNjtjbI#v=onepage&q=differential%20equations%20hirsch%20smale&f=false">2013</a>). Differential equations, dynamical systems, and an introduction to chaos. Boston, US: Academic Press.</li>
<li>Riley, S. (<a href="https://science.sciencemag.org/content/316/5829/1298?casa_token=6o-2ffWgMtoAAAAA:N5r-4nxfob2OhYutIaFKh4n5kxTeTMNkiAxLdipRtmFrlIhkLL69NOYUBXdYcUPG_pT8LCiGXFLpY4DI">2007</a>). Large-scale spatial-transmission models of infectious disease. <em>Science, 316</em>(5829), 1298-1301.</li>
<li>Watts, D. J., Muhamad, R., Medina, D. C., & Dodds, P. S. (<a href="https://www.pnas.org/content/102/32/11157">2005</a>). Multiscale, resurgent epidemics in a hierarchical metapopulation model. <em>Proceedings of the National Academy of Sciences, 102</em>(32), 11157-11162.</li>
</ul>Fabian DablanderLast summer, I wrote about love affairs and linear differential equations. While the topic is cheerful, linear differential equations are severely limited in the types of behaviour they can model. In this blog post, which I spent writing in self-quarantine to prevent further spread of SARS-CoV-2 — take that, cheerfulness — I introduce nonlinear differential equations as a means to model infectious diseases. In particular, we will discuss the simple SIR and SIRS models, the building blocks of many of the more complicated models used in epidemiology. Before doing so, however, I discuss some of the basic tools of nonlinear dynamics applied to the logistic equation as a model for population growth. If you are already familiar with this, you can skip ahead. If you have had no prior experience with differential equations, I suggest you first check out my earlier post on the topic. I should preface this by saying that I am not an epidemiologist, and that no analysis I present here is specifically related to the current SARS-CoV-2 pandemic, nor should anything I say be interpreted as giving advice or making predictions. I am merely interested in differential equations, and as with love affairs, infectious diseases make a good illustrating case. So without further ado, let’s dive in! Modeling Population Growth Before we start modeling infectious diseases, it pays to study the concepts required to study nonlinear differential equations on a simple example: modeling population growth. Let $N > 0$ denote the size of a population and assume that its growth depends on itself: As shown in a previous blog post, this leads to exponential growth for $r > 0$: where $N_0 = N(0)$ is the initial population size at time $t = 0$. The figure below visualizes the differential equation (left panel) and its solution (right panel) for $r = 1$ and an initial population of $N_0 = 2$. This is clearly not a realistic model since the growth of a population depends on resources, which are finite. To model finite resources, we write: where $r > 0$ and $K$ is the so-called carrying capacity, that is, the maximum sized population that can be sustained by the available resources. Observe that as $N$ grows and if $K > N$, then $(1 - N / K)$ gets smaller, slowing down the growth rate $\dot{N}$. If on the other hand $N > K$, then the population needs more resources than are available, and the growth rate becomes negative, resulting in population decrease. For simplicity, let $K = 1$ and interpret $N \in [0, 1]$ as the proportion with respect to the carrying capacity; that is, $N = 1$ implies that we are at carrying capacity. The figure below visualizes the differential equation and its solution for $r = 1$ and an initial condition $N_0 = 0.10$. In contrast to exponential growth, the logistic equation leads to sigmoidal growth which approaches the carrying capacity. This is much more interesting behaviour than the linear differential equation above allows. In particular, the logistic equation has two fixed points — points at which the population neither increases nor decreases but stays fixed, that is, where $\dot{N} = 0$. These occur at $N = 0$ and at $N = 1$, as can be inferred from the left panel in the figure above. Analyzing the Stability of Fixed Points What is the stability of these fixed points? Intuitively, $N = 0$ should be unstable; if there are individuals, then they procreate and the population increases. Similarly, $N = 1$ should be stable: if $N < 1$, then $\dot{N} > 0$ and the population grows towards $N = 1$, and if $N > 1$, then $\dot{N} < 0$ and individuals die until $N = 1$. To make this argument more rigorous, and to get a more quantitative assessment of how quickly perturbations move away from or towards a fixed point, we derive a differential equation for these small perturbations close to the fixed point (see also Strogatz, 2015, p. 24). Let $N^{\star}$ denote a fixed point and define $\eta(t) = N(t) - N^{\star}$ to be a small perturbation close to the fixed point. We derive a differential equation for $\eta$ by writing: since $N^{\star}$ is a constant. This implies that the dynamics of the perturbation equal the dynamics of the population. Let $f(N)$ denote the differential equation for $N$, observe that $N = N^{\star} + \eta$ such that $\dot{N} = \dot{\eta} = f(N) = f(N^{\star} + \eta)$. Recall that $f$ is a nonlinear function, and nonlinear functions are messy to deal with. Thus, we simply pretend that the function is linear close to the fixed point. More precisely, we approximate $f$ around the fixed point using a Taylor series (see this excellent video for details) by writing: where we have ignored higher order terms. Note that, by definition, there is no change at the fixed point, that is, $f(N^{\star}) = 0$. Assuming that $f’(N^{\star}) \neq 0$ — as otherwise the higher-order terms matter, as there would be nothing else — we have that close to a fixed point which is a linear differential equation with solution: Using this trick, we can assess the stability of $N^{\star}$ as follows. If $f’(N^{\star}) < 0$, the small perturbation $\eta(t)$ around the fixed point decays towards zero, and so the system returns to the fixed point — the fixed point is stable. On the other hand, if $f’(N^{\star}) > 0$, then the small perturbation $\eta(t)$ close to the fixed point grows, and so the system does not return to the fixed point — the fixed point is unstable. Applying this to our logistic equation, we see that: Plugging in our two fixed points $N^{\star} = 0$ and $N^{\star} = 1$, we find that $f’(0) = r$ and $f’(1) = -r$. Since $r > 0$, this confirms our suspicion that $N^{\star} = 0$ is unstable and $N^{\star} = 1$ is stable. In addition, this analysis tells us how quickly the perturbations grow or decay; for the logistic equation, this is given by $r$. In sum, we have linearized a nonlinear system close to fixed points in order to assess the stability of these fixed points, and how quickly perturbations close to these fixed points grow or decay. This technique is called linear stability analysis. In the next two sections, we discuss two ways to solve differential equations using the logistic equation as an example. Analytic Solution In contrast to linear differential equations, which was the topic of a previous blog post, nonlinear differential equations can usually not be solved analytically; that is, we generally cannot get an expression that, given an initial condition, tells us the state of the system at any time point $t$. The logistic equation can, however, be solved analytically and it might be instructive to see how. We write: Staring at this for a bit, we realize that we can use partial fractions to split the integral. We write: The exponents and the logs cancel each other nicely. We write: One last trick is to multiply by $e^{-rt + Z}$, which yields: where $Z$ is the constant of integration. To solve for it, we need the initial condition. Suppose that $N(0) = N_0$, which, using the third line in the derivation above and the fact that $t = 0$, leads to: Plugging this into our solution from above yields: While this was quite a hassle, other nonlinear differential equations are much, much harder to solve, and most do not admit a closed-form solution — or at least if they do, the resulting expression is generally not very intuitive. Luckily, we can compute the time-evolution of the system using numerical methods, as illustrated in the next section. Numerical Solution A differential equation implicitly encodes how the system we model changes over time. Specifically, given a particular (potentially high-dimensional) state of the system at time point $t$, $\mathbf{x}_t$, we know in which direction and how quickly the system will change because this is exactly what is encoded in the differential equation $f = \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t}$. This suggests the following numerical approximation: Assume we know the state of the system at a (discrete) time point $n$, denoted $x_n$, and that the change in the system is constant over a small interval $\Delta_t$. Then, the position of the system at time point $n + 1$ is given by: $\Delta t$ is an important parameter, encoding over what time period we assume the change $f$ to be constant. We can code this up in R for the logistic equation:Reviewing one year of blogging2019-12-27T12:00:00+00:002019-12-27T12:00:00+00:00https://fabiandablander.com/r/Reviewing-2019<p>Writing blog posts has been one of the most rewarding experiences for me over the last year. Some posts turned out quite long, others I could keep more concise. Irrespective of length, however, I have managed to publish one post every month, and you can infer the occassional frenzy that ensued from the distribution of the dates the posts appeared on — nine of them saw the light within the last three days of a month.</p>
<p>Some births were easier than others, yet every post evokes distinct memories: of perusing history books in the library and the Saturday sun; of writing down Gaussian integrals in overcrowded trains; of solving differential equations while singing; of hunting down typos before hurrying to parties. So to end this very productive year of blogging, below I provide a teaser of each previous post, summarizing one or two key takeaways. Let’s go!</p>
<!-- I started this blog last January, aiming to publish one blog post per month. It has been an extremely rewarding experience: every post allowed me to dive into a topic in a playful manner, and I was anew excited every month, wondering what I would write about. Some posts turned out quite lengthy, others were more concise. Far be it for me to suppose you have read every one of them, so to end this very productive year of blogging, this post provides a teaser of each previous post, summarizing one or two key take-aways. I hope you enjoy the show! -->
<!-- Blogging is great. In this post, I review what has happened since the inception of this blog in January. I will briefly summarize each blog post, and stress what I think are some key ideas. I will do so in reverse chronological order, starting with the most recent post. -->
<h1 id="an-introduction-to-causal-inference">An introduction to Causal inference</h1>
<p>Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. It dethrones randomized control trials as the only tool to license causal statements, describing the conditions under which this feat is possible even in observational data.</p>
<p>One key takeaway is to think about causal inference in a hierarchy. Association is at the most basic level, merely allowing us to say that two variables are somehow related. Moving upwards, the <em>do</em>-operator allows us to model interventions, answering questions such as “what would happen if we force every patient to take the drug”? Directed Acyclic Graphs (DAGs), as visualized in the figure below, allow us to visualize associations and causal relations.</p>
<center>
<img src="../assets/img/Seeing-vs-Doing-II.png" align="center" style="padding: 00px 00px 00px 00px;" width="750" height="500" />
</center>
<p>On the third and final level we find counterfactual statements. These follow from so-called <em>Structural Causal Models</em> — the building block of this approach to causal inference. Counterfactuals allow us to answer questions such as “would the patient have recovered had she been given the drug, even though she has not received the drug and did not recover”? Needless to say, this requires strong assumptions; yet if we want to endow machines with human-level reasoning or formalize concepts such as fairness, we need to make such strong assumptions.</p>
<p>One key practical take a way from this blog post is the definition of confounding: an effect is confounded if $p(Y \mid X) \neq p(Y \mid do(X = x))$. This means that blindly entering all variables into a regression to “control” for them is misguided; instead, one should carefuly think about the underlying causal relations between variables so as to not induce spurious associations. You can read the full blog post <a href="https://fabiandablander.com/r/Causal-Inference.html">here</a>.</p>
<h1 id="a-brief-primer-on-variational-inference">A brief primer on Variational Inference</h1>
<p>Bayesian inference using Markov chain Monte Carlo can be notoriously slow. The key idea behind variational inference is to recast Bayesian inference as an optimization problem. In particular, we try to find a distribution $q^\star(\mathbf{z})$ that best approximates the posterior distribution $p(\mathbf{z} \mid \mathbf{x})$ in terms of the Kullback-Leibler divergence:</p>
<script type="math/tex; mode=display">q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace .</script>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>In this blog post, I explain how a particular form of variational inference — <em>coordinate ascent mean-field variational inference</em> — leads to fast computations. Specifically, I walk you through deriving the variational inference scheme for a simple linear regression example. One key takeaway from this post is that Bayesians can use optimization to speed up computation. However, variational inference requires problem-specific, often tedious calculations. Black-box variational inference schemes can alleviate this issue, but Stan’s implementation — <em>automatic differentiation variational inference</em> — seems to work poorly, as detailed in the post (see also Ben Goodrich’s comment). You can read the full blog post <a href="https://fabiandablander.com/r/Variational-Inference.html">here</a>.</p>
<h1 id="harry-potter-and-the-power-of-bayesian-constrained-inference">Harry Potter and the Power of Bayesian Constrained Inference</h1>
<p>Are you a Gryffindor, Slytherin, Hufflepuff, or Ravenclaw? In this blog post, I explain a <em>prior predictive</em> perspective on model selection by having Harry, Ron, and Hermione — three subjective Bayesians — engage in a small prediction contest. There are two key takeaways. First, the prior does not completely constrain a model’s prediction, as these are being made by combining the prior with the likelihood. For example, even though Ron has a point prior on $\theta = 0.50$ in the figure below, his prediction is not that $y = 5$ always; instead, he predicts a distribution that is centered around $y = 5$. Similarly, while Hermione believes that $\theta > 0.50$, she puts probability mass on values $y < 5$.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>The second takeaway is computational. In particular, one can compute the Bayes factor of the <em>unconstrained</em> model ($\mathcal{M}_1$) — in which the parameter $\theta$ is free to vary — against a <em>constrained</em> model ($\mathcal{M}_r$) — in which $\theta$ is order-constrained (e.g., $\theta > 0.50$) — as:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{p(\theta \in [0.50, 1] \mid y, \mathcal{M}_1)}{p(\theta \in [0.50, 1] \mid \mathcal{M}_1)} \enspace .</script>
<p>In words, this Bayes factor is given by the ratio of the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. You can read the full blog post <a href="https://fabiandablander.com/r/Bayes-Potter.html">here</a>.</p>
<h1 id="love-affairs-and-linear-differential-equations">Love affairs and linear differential equations</h1>
<blockquote>
When you can fall for chains of silver, you can fall for chains of gold <br />
You can fall for pretty strangers and the promises they hold <br />
You promised me everything, you promised me thick and thin, yeah <br />
Now you just say "Oh, Romeo, yeah, you know I used to have a scene with him"
</blockquote>
<p>Differential equations are the sine qua non of modeling how systems change. This blog post provides an introduction to <em>linear</em> differential equations, which admit closed-form solutions, and analyzes the stability of fixed points.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>The key takeaways are that the natural basis of analysis is the basis spanned by the eigenvectors, and that the stability of fixed points depends directly on the eigenvalues. A system with imaginary eigenvalues can exhibit oscillating behaviour, as shown in the figure above.</p>
<p>I think I rarely had more fun writing than when writing this blog post. Inspired by Strogatz (1988), it playfully introduces linear differential equations by classifying the types of relationships Romeo and Juliet might find themselves in. While writing it, I also listened to a lot of Dire Straits, Bob Dylan, Daft Punk, and others, whose lyrics decorate the post’s section. You can read the full blog post <a href="https://fabiandablander.com/r/Linear-Love.html">here</a>.</p>
<h1 id="the-fibonacci-sequence-and-linear-algebra">The Fibonacci sequence and linear algebra</h1>
<p>1, 1, 2, 3, 5, 8, 13, 21, … The Fibonacci sequence might well be the most widely known mathematical sequence. In this blog post, I discuss how Leonardo Bonacci derived it as a solution to a puzzle about procreating rabbits, and how linear algebra can help us find a closed-form expression of the $n^{\text{th}}$ Fibonacci number.</p>
<div style="text-align:center;">
<img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" />
</div>
<p>The key insight is to realize that the $n^{\text{th}}$ Fibonacci number can be computed by repeatedly performing matrix multiplications. If one <em>diagonalizes</em> this matrix, changing basis to — again! — the eigenbasis, then the repeated application of this matrix can be expressed as a scalar power, yielding a closed-form expression of the $n^{\text{th}}$ Fibonacci number. That’s a mouthful; you can read the blog post which explains things much better <a href="https://fabiandablander.com/r/Fibonacci.html">here</a>.</p>
<h1 id="spurious-correlations-and-random-walks">Spurious correlations and random walks</h1>
<p>I was at the Santa Fe Complex Systems Summer School — the experience of a lifetime — when Anton Pichler and Andrea Bacilieri, two economists, told me that two independent random walks can be correlated substantially. I was quite shocked, to be honest. This blog post investigates this issue, concluding that regressing one random walk onto another is <em>nonsensical</em>, that is, leads to an inconsistent parameter estimate.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>As the figure above shows, such spurious correlation also occurs for independent AR(1) processes with increasing autocorrelation $\phi$, even though the resulting estimate is consistent. The key takeaway is therefore to be careful when correlating time-series. You can read the full blog post <a href="https://fabiandablander.com/r/Spurious-Correlation.html">here</a>.</p>
<h1 id="bayesian-modeling-using-stan-a-case-study">Bayesian modeling using Stan: A case study</h1>
<p>Model selection is a difficult problem. In Bayesian inference, we may distinguish between two approaches to model selection: a <em>(prior) predictive</em> perspective based on marginal likelihoods, and a <em>(posterior) predictive</em> perspective based on leave-one-out cross-validation.</p>
<p><img src="../assets/img/prediction-perspectives.png" align="center" style="padding: 10px 10px 10px 10px;" /></p>
<p>A prior predictive perspective — illustrated in the left part of the figure above — evaluates models based on their predictions about the data actually observed. These predictions are made by combining likelihood and prior. In contrast, a posterior predictive perspective — illustrated in the right panel of the figure above — evaluates models based on their predictions about data that we have not observed. These predictions cannot be directly computed, but can be approximated by combining likelihood and posterior in a leave-one-out cross-validation scheme. They key takeaway of this blog post is to appreciate this distinction, noting that not all Bayesians agree on how to select among models.</p>
<p>The post illustrates these two perspectives with a case study: does the relation between practice and reaction time follow a power law or an exponential function? You can read the full blog post <a href="https://fabiandablander.com/r/Law-of-Practice.html">here</a>.</p>
<h1 id="two-perspectives-on-regularization">Two perspectives on regularization</h1>
<p>Regularization is the process of adding information to an estimation problem so as to avoid extreme estimates. This blog post explores regularization both from a Bayesian and from a classical perspective, using the simplest example possible: estimating the bias of a coin.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>The key takeaway is the observation that Bayesians have a natural tool for regularization at their disposal: the prior. In contrast to the left panel in the figure above, which shows a flat prior, the right panel illustrates that using a weakly informative prior that peaks at $\theta = 0.50$ shifts the resulting posterior distribution towards that value. In classical statistics, one usually uses penalized maximum likelihood approaches — think lasso and ridge regression — to achieve regularization. You can read the full blog post <a href="https://fabiandablander.com/r/Regularization.html">here</a>.</p>
<h1 id="variable-selection-using-gibbs-sampling">Variable selection using Gibbs sampling</h1>
<p>“Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using spike-and-slab priors and the Gibbs sampler, a computational method to sample from a joint distribution using only conditional distributions.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Parameter estimation is almost always conditional on a specific model. One key takeaway from this blog post is that there is uncertainty associated with the model itself. The approach outlined in the post accounts for this uncertainty by using spike-and-slab priors, yielding posterior distributions not only for parameters but also for models. To incorporate this model uncertainty into parameter estimation, one can average across models; the figure above shows the <em>model-averaged</em> posterior distribution for six variables discussed in the post. You can read the full blog post <a href="https://fabiandablander.com/r/Spike-and-Slab.html">here</a>.</p>
<h1 id="two-properties-of-the-gaussian-distribution">Two properties of the Gaussian distribution</h1>
<p>The Gaussian distribution is special for a number of reasons. In this blog post, I focus on two such reasons, namely the fact that it is closed under marginalization and conditioning. This means that if you start out with a <em>p</em>-dimensional Gaussian distribution, and you either <em>marginalize over</em> or <em>condition on</em> one of its components, the resulting distribution will again be Gaussian.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>The figure above illustrates the difference between marginalization and conditioning in the two-dimensional case. The left panel shows a bivariate Gaussian distribution with a high correlation $\rho = 0.80$ (blue contour lines). Conditioning means incorporating information, and observing that $X_2 = 2$ shifts the distribution of $X_1$ towards this value (purple line). If we do not observe $X_2$, we can incorporate our uncertainty about its likely values by marginalizing it out. This results in a Gaussian distribution that is centered on zero (black line). The right panel shows that conditioning on $X_2 = 2$ does not change the distribution of $X_1$ in the case of no correlation $\rho = 0$. You can read the full blog post <a href="https://fabiandablander.com/statistics/Two-Properties.html">here</a>.</p>
<h1 id="curve-fitting-and-the-gaussian-distribution">Curve fitting and the Gaussian distribution</h1>
<p>In this blog post, we take a look at the mother of all curve fitting problems — fitting a straight line to a number of points. The figure below shows that one point in the Euclidean plane is insufficient to define a line (left), two points constrain it perfectly (middle), and three is too much (right). In science we usually deal with more than two data points which are corrupted by noise. How do we fit a line to such noisy observations?</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-8-1.png" title="plot of chunk unnamed-chunk-8" alt="plot of chunk unnamed-chunk-8" style="display: block; margin: auto 0 auto auto;" /></p>
<p>The methods of least squares provides an answer. In addition to an explanation of least squares, a key takeaway of this post is an understanding for the historical context in which least squares arose. Statistics is fascinating in part because of its rich history. On our journey through time we meet Legendre, Gauss, Laplace, and Galton. The latter describes the central limit theorem — one of the most stunning theorems in statistics — in beautifully poetic words:</p>
<blockquote>
<p>“I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66)</p>
</blockquote>
<p>You can read the full blog post <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">here</a>.</p>
<p>I hope that you enjoyed reading some of these posts at least a quarter as much as I enjoyed writing them. I am committed to making 2020 a successful year of blogging, too. However, I will most likely decrease the output frequency by half, aiming to publish one post every two months. It is a truth universally acknowledged that a person in want of a PhD must be in possession of publications, and so I will have to shift my focus accordingly (at least a little bit). At the same time, I also want to further increase my involvement in the “data for the social good” scene. Life certainly is one complicated optimization problem. I wish you all the best for the new year!</p>
<hr />
<p><em>I would like to thank Don van den Bergh, Sophia Crüwell, Jonas Haslbeck, Oisín Ryan, Lea Jakob, Quentin Gronau, Nathan Evans, Andrea Bacilieri, and Anton Pichler for helpful comments on (some of) these blog posts.</em></p>Fabian DablanderWriting blog posts has been one of the most rewarding experiences for me over the last year. Some posts turned out quite long, others I could keep more concise. Irrespective of length, however, I have managed to publish one post every month, and you can infer the occassional frenzy that ensued from the distribution of the dates the posts appeared on — nine of them saw the light within the last three days of a month. Some births were easier than others, yet every post evokes distinct memories: of perusing history books in the library and the Saturday sun; of writing down Gaussian integrals in overcrowded trains; of solving differential equations while singing; of hunting down typos before hurrying to parties. So to end this very productive year of blogging, below I provide a teaser of each previous post, summarizing one or two key takeaways. Let’s go! An introduction to Causal inference Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. It dethrones randomized control trials as the only tool to license causal statements, describing the conditions under which this feat is possible even in observational data. One key takeaway is to think about causal inference in a hierarchy. Association is at the most basic level, merely allowing us to say that two variables are somehow related. Moving upwards, the do-operator allows us to model interventions, answering questions such as “what would happen if we force every patient to take the drug”? Directed Acyclic Graphs (DAGs), as visualized in the figure below, allow us to visualize associations and causal relations. On the third and final level we find counterfactual statements. These follow from so-called Structural Causal Models — the building block of this approach to causal inference. Counterfactuals allow us to answer questions such as “would the patient have recovered had she been given the drug, even though she has not received the drug and did not recover”? Needless to say, this requires strong assumptions; yet if we want to endow machines with human-level reasoning or formalize concepts such as fairness, we need to make such strong assumptions. One key practical take a way from this blog post is the definition of confounding: an effect is confounded if $p(Y \mid X) \neq p(Y \mid do(X = x))$. This means that blindly entering all variables into a regression to “control” for them is misguided; instead, one should carefuly think about the underlying causal relations between variables so as to not induce spurious associations. You can read the full blog post here. A brief primer on Variational Inference Bayesian inference using Markov chain Monte Carlo can be notoriously slow. The key idea behind variational inference is to recast Bayesian inference as an optimization problem. In particular, we try to find a distribution $q^\star(\mathbf{z})$ that best approximates the posterior distribution $p(\mathbf{z} \mid \mathbf{x})$ in terms of the Kullback-Leibler divergence: In this blog post, I explain how a particular form of variational inference — coordinate ascent mean-field variational inference — leads to fast computations. Specifically, I walk you through deriving the variational inference scheme for a simple linear regression example. One key takeaway from this post is that Bayesians can use optimization to speed up computation. However, variational inference requires problem-specific, often tedious calculations. Black-box variational inference schemes can alleviate this issue, but Stan’s implementation — automatic differentiation variational inference — seems to work poorly, as detailed in the post (see also Ben Goodrich’s comment). You can read the full blog post here. Harry Potter and the Power of Bayesian Constrained Inference Are you a Gryffindor, Slytherin, Hufflepuff, or Ravenclaw? In this blog post, I explain a prior predictive perspective on model selection by having Harry, Ron, and Hermione — three subjective Bayesians — engage in a small prediction contest. There are two key takeaways. First, the prior does not completely constrain a model’s prediction, as these are being made by combining the prior with the likelihood. For example, even though Ron has a point prior on $\theta = 0.50$ in the figure below, his prediction is not that $y = 5$ always; instead, he predicts a distribution that is centered around $y = 5$. Similarly, while Hermione believes that $\theta > 0.50$, she puts probability mass on values $y < 5$. The second takeaway is computational. In particular, one can compute the Bayes factor of the unconstrained model ($\mathcal{M}_1$) — in which the parameter $\theta$ is free to vary — against a constrained model ($\mathcal{M}_r$) — in which $\theta$ is order-constrained (e.g., $\theta > 0.50$) — as: In words, this Bayes factor is given by the ratio of the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. You can read the full blog post here. Love affairs and linear differential equations When you can fall for chains of silver, you can fall for chains of gold You can fall for pretty strangers and the promises they hold You promised me everything, you promised me thick and thin, yeah Now you just say "Oh, Romeo, yeah, you know I used to have a scene with him" Differential equations are the sine qua non of modeling how systems change. This blog post provides an introduction to linear differential equations, which admit closed-form solutions, and analyzes the stability of fixed points. The key takeaways are that the natural basis of analysis is the basis spanned by the eigenvectors, and that the stability of fixed points depends directly on the eigenvalues. A system with imaginary eigenvalues can exhibit oscillating behaviour, as shown in the figure above. I think I rarely had more fun writing than when writing this blog post. Inspired by Strogatz (1988), it playfully introduces linear differential equations by classifying the types of relationships Romeo and Juliet might find themselves in. While writing it, I also listened to a lot of Dire Straits, Bob Dylan, Daft Punk, and others, whose lyrics decorate the post’s section. You can read the full blog post here. The Fibonacci sequence and linear algebra 1, 1, 2, 3, 5, 8, 13, 21, … The Fibonacci sequence might well be the most widely known mathematical sequence. In this blog post, I discuss how Leonardo Bonacci derived it as a solution to a puzzle about procreating rabbits, and how linear algebra can help us find a closed-form expression of the $n^{\text{th}}$ Fibonacci number. The key insight is to realize that the $n^{\text{th}}$ Fibonacci number can be computed by repeatedly performing matrix multiplications. If one diagonalizes this matrix, changing basis to — again! — the eigenbasis, then the repeated application of this matrix can be expressed as a scalar power, yielding a closed-form expression of the $n^{\text{th}}$ Fibonacci number. That’s a mouthful; you can read the blog post which explains things much better here. Spurious correlations and random walks I was at the Santa Fe Complex Systems Summer School — the experience of a lifetime — when Anton Pichler and Andrea Bacilieri, two economists, told me that two independent random walks can be correlated substantially. I was quite shocked, to be honest. This blog post investigates this issue, concluding that regressing one random walk onto another is nonsensical, that is, leads to an inconsistent parameter estimate. As the figure above shows, such spurious correlation also occurs for independent AR(1) processes with increasing autocorrelation $\phi$, even though the resulting estimate is consistent. The key takeaway is therefore to be careful when correlating time-series. You can read the full blog post here. Bayesian modeling using Stan: A case study Model selection is a difficult problem. In Bayesian inference, we may distinguish between two approaches to model selection: a (prior) predictive perspective based on marginal likelihoods, and a (posterior) predictive perspective based on leave-one-out cross-validation. A prior predictive perspective — illustrated in the left part of the figure above — evaluates models based on their predictions about the data actually observed. These predictions are made by combining likelihood and prior. In contrast, a posterior predictive perspective — illustrated in the right panel of the figure above — evaluates models based on their predictions about data that we have not observed. These predictions cannot be directly computed, but can be approximated by combining likelihood and posterior in a leave-one-out cross-validation scheme. They key takeaway of this blog post is to appreciate this distinction, noting that not all Bayesians agree on how to select among models. The post illustrates these two perspectives with a case study: does the relation between practice and reaction time follow a power law or an exponential function? You can read the full blog post here. Two perspectives on regularization Regularization is the process of adding information to an estimation problem so as to avoid extreme estimates. This blog post explores regularization both from a Bayesian and from a classical perspective, using the simplest example possible: estimating the bias of a coin. The key takeaway is the observation that Bayesians have a natural tool for regularization at their disposal: the prior. In contrast to the left panel in the figure above, which shows a flat prior, the right panel illustrates that using a weakly informative prior that peaks at $\theta = 0.50$ shifts the resulting posterior distribution towards that value. In classical statistics, one usually uses penalized maximum likelihood approaches — think lasso and ridge regression — to achieve regularization. You can read the full blog post here. Variable selection using Gibbs sampling “Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using spike-and-slab priors and the Gibbs sampler, a computational method to sample from a joint distribution using only conditional distributions. Parameter estimation is almost always conditional on a specific model. One key takeaway from this blog post is that there is uncertainty associated with the model itself. The approach outlined in the post accounts for this uncertainty by using spike-and-slab priors, yielding posterior distributions not only for parameters but also for models. To incorporate this model uncertainty into parameter estimation, one can average across models; the figure above shows the model-averaged posterior distribution for six variables discussed in the post. You can read the full blog post here. Two properties of the Gaussian distribution The Gaussian distribution is special for a number of reasons. In this blog post, I focus on two such reasons, namely the fact that it is closed under marginalization and conditioning. This means that if you start out with a p-dimensional Gaussian distribution, and you either marginalize over or condition on one of its components, the resulting distribution will again be Gaussian. The figure above illustrates the difference between marginalization and conditioning in the two-dimensional case. The left panel shows a bivariate Gaussian distribution with a high correlation $\rho = 0.80$ (blue contour lines). Conditioning means incorporating information, and observing that $X_2 = 2$ shifts the distribution of $X_1$ towards this value (purple line). If we do not observe $X_2$, we can incorporate our uncertainty about its likely values by marginalizing it out. This results in a Gaussian distribution that is centered on zero (black line). The right panel shows that conditioning on $X_2 = 2$ does not change the distribution of $X_1$ in the case of no correlation $\rho = 0$. You can read the full blog post here. Curve fitting and the Gaussian distribution In this blog post, we take a look at the mother of all curve fitting problems — fitting a straight line to a number of points. The figure below shows that one point in the Euclidean plane is insufficient to define a line (left), two points constrain it perfectly (middle), and three is too much (right). In science we usually deal with more than two data points which are corrupted by noise. How do we fit a line to such noisy observations? The methods of least squares provides an answer. In addition to an explanation of least squares, a key takeaway of this post is an understanding for the historical context in which least squares arose. Statistics is fascinating in part because of its rich history. On our journey through time we meet Legendre, Gauss, Laplace, and Galton. The latter describes the central limit theorem — one of the most stunning theorems in statistics — in beautifully poetic words: “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66) You can read the full blog post here. I hope that you enjoyed reading some of these posts at least a quarter as much as I enjoyed writing them. I am committed to making 2020 a successful year of blogging, too. However, I will most likely decrease the output frequency by half, aiming to publish one post every two months. It is a truth universally acknowledged that a person in want of a PhD must be in possession of publications, and so I will have to shift my focus accordingly (at least a little bit). At the same time, I also want to further increase my involvement in the “data for the social good” scene. Life certainly is one complicated optimization problem. I wish you all the best for the new year! I would like to thank Don van den Bergh, Sophia Crüwell, Jonas Haslbeck, Oisín Ryan, Lea Jakob, Quentin Gronau, Nathan Evans, Andrea Bacilieri, and Anton Pichler for helpful comments on (some of) these blog posts.An introduction to Causal inference2019-11-30T12:00:00+00:002019-11-30T12:00:00+00:00https://fabiandablander.com/r/Causal-Inference<p><em>An extended version of this blog post is available from <a href="https://psyarxiv.com/b3fkw">here</a>.</em></p>
<p>Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. In this blog post, I provide an introduction to the graphical approach to causal inference in the tradition of Sewell Wright, Judea Pearl, and others.</p>
<p>We first rehash the common adage that correlation is not causation. We then move on to climb what Pearl calls the “ladder of causal inference”, from association (<em>seeing</em>) to intervention (<em>doing</em>) to counterfactuals (<em>imagining</em>). We will discover how directed acyclic graphs describe conditional (in)dependencies; how the <em>do</em>-calculus describes interventions; and how Structural Causal Models allow us to imagine what could have been. This blog post is by no means exhaustive, but should give you a first appreciation of the concepts that surround causal inference; references to further readings are provided below. Let’s dive in!<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<h1 id="correlation-and-causation">Correlation and Causation</h1>
<p>Messerli (2012) published a paper entitled “Chocolate Consumption, Cognitive Function, and Nobel Laureates” in <em>The New England Journal of Medicine</em> showing a strong positive relationship between chocolate consumption and the number of Nobel Laureates. I have found an even stronger relationship using updated data<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>, as visualized in the figure below.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<!-- <center> -->
<!-- <img src="../assets/img/Chocolate.png" align="center" style="padding: 10px 10px 10px 10px;" width="450" height="150"/> -->
<!-- </center> -->
<!-- Similarly, this great website tells us that US spending on science, space, and technology correlates strongly with suicides by hanging, strangulation, and suffocation: -->
<!-- <center> -->
<!-- <!-- <img src="../assets/img/US-Spending.png" align="center" style="padding: 10px 10px 10px 10px;" width="650" height="550"/> -->
<!-- <img src="../assets/img/US-Spending.png" align="center" style="padding: 10px 10px 10px 10px;"/> -->
<!-- </center> -->
<p>Now, except for <a href="https://www.confectionerynews.com/Article/2012/10/11/Chocolate-creates-Nobel-prize-winners-says-study">people in the chocolate business</a>, it would be quite a stretch to suggest that increasing chocolate consumption would increase the number Nobel Laureates. Correlation does not imply causation because it does not constrain the possible causal relations enough. Hans Reichenbach (1956) formulated the <em>common cause principle</em> which speaks to this fact:</p>
<blockquote>
<p>If two random variables $X$ and $Y$ are statistically dependent ($X \not \perp Y$), then either (a) $X$ causes $Y$, (b) $Y$ causes $X$, or (c) there exists a third variable $Z$ that causes both $X$ and $Y$. Further, $X$ and $Y$ become independent given $Z$, i.e., $X \perp Y \mid Z$.</p>
</blockquote>
<p>An in principle straightforward way to break this uncertainty is to conduct an experiment: we could, for example, force the citizens of Austria to consume more chocolate, and study whether this increases the number of Nobel laureates in the following years. Such experiments are clearly unfeasible, but even in less extreme settings it is frequently unethical, impractical, or impossible — think of smoking and lung cancer — to study an association experimentally.</p>
<p>Causal inference provides us with tools that license causal statements even in the absence of a true experiment. This comes with strong assumptions. In the next section, we discuss the “causal hierarchy”.</p>
<h1 id="the-causal-hierarchy">The Causal Hierarchy</h1>
<p>Pearl (2019a) introduces a causal hierarchy with three levels — association, intervention, and counterfactuals — as well as three prototypical actions corresponding to each level — <em>seeing</em>, <em>doing</em>, and <em>imagining</em>. In the remainder of this blog post, we will tackle each level in turn.</p>
<h1 id="seeing">Seeing</h1>
<p>Association is on the most basic level; it makes us see that two or more things are somehow related. Importantly, we need to distinguish between <em>marginal</em> associations and <em>conditional</em> associations. The latter are the key building block of causal inference. The figure below illustrates these two concepts.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>If we look at the whole, aggregated data on the left we see that the continuous variables $X$ and $Y$ are positively correlated: an increase in values for $X$ co-occurs with an increase in values for $Y$. This relation describes the <em>marginal</em> association of $X$ and $Y$ because we do not care whether $Z = 0$ or $Z = 1$. On the other hand, if we condition on the binary variable $Z$, we find that there is no relation: $X \perp Y \mid Z$. (For more on marginal and conditional associations in the case of Gaussian distributions, see <a href="https://fabiandablander.com/statistics/Two-Properties.html">this</a> blog post). In the next section, we discuss a powerful tool that allows us to visualize such dependencies.</p>
<h2 id="directed-acyclic-graphs">Directed Acyclic Graphs</h2>
<p>We can visualize the statistical dependencies between the three variables using a graph. A graph is a mathematical object that consists of nodes and edges. In the case of <em>Directed Acyclic Graphs</em> (DAGs), these edges are directed. We take our variables $(X, Y, Z$) to be nodes in such DAG and we draw (or omit) edges between these nodes so that the conditional (in)dependence structure in the data is reflected in the graph. We will explain this more formally shortly. For now, let’s focus on the relationship between the three variables. We have seen that $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. It turns out that we can draw <em>three</em> DAGs that encode this fact; these are the first three DAGs in the figure below. $X$ and $Y$ are dependent through $Z$ in these graphs, and conditioning on $Z$ <em>blocks</em> the path between $X$ and $Y$. We state this more formally shortly.</p>
<center>
<img src="../assets/img/Seeing-II.png" align="center" style="padding: 0px 0px 0px 0px;" width="750" height="375" />
</center>
<p>While it is natural to interpret the arrows causally, we do not do so here. For now, the arrows are merely tools that help us describe associations between variables.</p>
<p>The figure above also shows a fourth DAG, which encodes a different set of conditional (in)dependence relations between $X$, $Y$, and $Z$. The figure below illustrates this: looking at the aggregated data we do not find a relation between $X$ and $Y$ — they are <em>marginally independent</em> — but we do find one when looking at the disaggregated data — $X$ and $Y$ are <em>conditionally dependent</em> given $Z$.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>A real-world example might help build intuition: Looking at people who are single and who are in a relationship as a separate group, being attractive ($X$) and being intelligent ($Y$) are two independent traits. This is what we see in the left panel in the figure above. Let’s make the reasonable assumption that both being attractive and being intelligent are positively related with being in a relationship. What does this imply? First, it implies that, on average, single people are less attractive and less intelligent (see red data points). Second, and perhaps counter-intuitively, it implies that in the population of single people (and people in a relationship, respectively), being attractive and being intelligent are <em>negatively correlated</em>. After all, if the handsome person you met at the bar were also intelligent, then he would most likely be in a relationship!</p>
<p>In this example, visualized in the fourth DAG, $Z$ is commonly called a <em>collider</em>. Suppose we want to estimate the association between $X$ and $Y$ in the whole population. Conditioning on a collider (for example, by only analyzing data from people who are not in a relationship) while computing the association between $X$ and $Y$ will lead to a different estimate, and the induced bias is known as <em>collider bias</em>. It is a serious issue not only in dating, but also for example in medicine.</p>
<p>The simple graphs shown above are the building blocks of more complicated graphs. In the next section, we describe a tool that can help us find (conditional) independencies between sets of variables.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<!-- The conditional independence relations are easily glanced from these simple graphs. For *chains* and *forks*, $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. For *colliders*, we have that they are marginally independent, but conditionally dependent given $Z$ --- think of our dating example. For larger graphs, it is more difficult to see this. -->
<h2 id="d-separation">$d$-separation</h2>
<p>For large graphs, it is not obvious how to conclude that two nodes are (conditionally) independent. <em>d</em>-separation is a tool that allows us to check this algorithmically. We need to define some concepts:</p>
<ul>
<li>A <em>path</em> from $X$ to $Y$ is a sequence of nodes and edges such that the start and end nodes are $X$ and $Y$, respectively.</li>
<li>A conditioning set $\mathcal{L}$ is the set of nodes we condition on (it can be empty).</li>
<li>A collider along a path blocks that path. However, conditioning on a collider (or any of its descendants) unblocks that path.</li>
</ul>
<p>With these definitions out of the way, we call two nodes $X$ and $Y$ $d$-separated by $\mathcal{L}$ if conditioning on all members in $\mathcal{L}$ blocks all paths between the two nodes.</p>
<p>If this is your first encounter with $d$-separation, then this is a lot to wrap your head around. To get some practice, look at the graph on the left side. First, note that there are no <em>marginal</em> dependencies; this means that without conditioning or blocking nodes, any two nodes are connected by a path. For example, there is a path going from $X$ to $Y$ through $Z$, and there is a path from $V$ to $U$ going through $Y$ and $W$.</p>
<div style="float: left;">
<center>
<img src="../assets/img/Large-DAG.png" align="center" style="margin-right: 10px;" width="350" height="125" />
</center>
</div>
<p>However, there are a number of <em>conditional</em> independencies. For example, $X$ and $Y$ are conditionally independent given $Z$. Why? There are two paths from $X$ to $Y$: one through $Z$ and one through $W$. However, since $W$ is a collider on the path from $X$ to $Y$, the path is already blocked. The only unblocked path from $X$ to $Y$ is through $Z$, and conditioning on it therefore blocks all remaining open paths. Additionally conditioning on $W$ would unblock one path, and $X$ and $Y$ would again be associated.</p>
<p>So far, we have implicitly assumed that conditional (in)dependencies in the graph correspond to conditional (in)dependencies between variables. We make this assumption explicit now. In particular, note that <em>d</em>-separation provides us with an independence model $\perp_{\mathcal{G}}$ defined on graphs. To connect this to our standard probabilistic independence model $\perp_{\mathcal{P}}$ defined on random variables, we assume the following <em>Markov property</em>:</p>
<script type="math/tex; mode=display">X \perp_{\mathcal{G}} Y \mid Z \implies X \perp_{\mathcal{P}} Y \mid Z \enspace .</script>
<p>In words, we assume that if the nodes $X$ and $Y$ are <em>d</em>-separated by $Z$ in the graph $\mathcal{G}$, the corresponding random variables $X$ and $Y$ are conditionally independent given $Z$. This implies that all conditional independencies in the data are represented in the graph.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup> Moreover, the statement above implies (and is implied by) the following factorization:</p>
<script type="math/tex; mode=display">p(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace ,</script>
<p>where $\text{pa}^{\mathcal{G}}(X_i)$ denotes the parents of the node $X_i$ in graph $\mathcal{G}$ (see Peters, Janzing, & Schölkopf, p. 101). A node is a parent of another node if it has an outgoing arrow to that node; for example, $X$ is a parent of $Z$ and $W$ in the graph above. The above factorization implies that a node $X$ is independent of its non-descendants given its parents.</p>
<p><em>d</em>-separation is an extremely powerful tool. Until now, however, we have only looked at DAGs to visualize (conditional) independencies. In the next section, we go beyond <em>seeing</em> to <em>doing</em>.</p>
<h1 id="doing">Doing</h1>
<p>We do not merely want to see the world, but also change it. From this section on, we are willing to interpret DAGs causally. As Dawid (2009a) warns, this is a serious step. In merely describing conditional independencies — <em>seeing</em> — the arrows in the DAG played a somewhat minor role, being nothing but “incidental construction features supporting the $d$-separation semantics” (Dawid, 2009a, p. 66). In this section, we endow the DAG with a causal meaning and interpret the arrows as denoting <em>direct causal effects</em>.</p>
<p>What is a causal effect? Following Pearl and others, we take an <em>interventionist</em> position and say that a variable $X$ has a causal influence on $Y$ if changing $X$ leads to changes in $Y$. This position is a very useful one in practice, but not everybody agrees with it (e.g., Cartwright, 2007).</p>
<p>The figure below shows the observational DAGs from above (top row) as well as the manipulated DAGs (bottom row) where we have intervened on the variable $X$, that is, set the value of the random variable $X$ to a constant $x$. Note that setting the value of $X$ cuts all incoming causal arrows since its value is thereby determined only by the intervention, not by any other factors.</p>
<center>
<img src="../assets/img/Seeing-vs-Doing-II.png" align="center" style="padding: 00px 00px 00px 00px;" width="750" height="500" />
</center>
<p>As is easily verified with $d$-separation, the first three graphs in the top row encode the same conditional independence structure. This implies that we cannot distinguish them using only observational data. Interpreting the edges causally, however, we see that the DAGs have a starkly different interpretation. The bottom row makes this apparent by showing the result of an intervention on $X$. In the leftmost causal DAG, $Z$ is on the causal path from $X$ to $Y$, and intervening on $X$ therefore influences $Y$ through $Z$. In the DAG next, to it $Z$ is on the causal path from $Y$ to $X$, and so intervening on $X$ does not influence $Y$. In the third DAG, $Z$ is a common cause and — since there is no other path from $X$ to $Y$ — intervening on $X$ does not influence $Y$. For the collider structure in the rightmost DAG, intervening on $X$ does not influence $Y$ because there is no unblocked path from $X$ to $Y$.</p>
<p>To make the distinction between seeing and doing, Pearl introduced the <em>do</em>-operator. While $p(Y \mid X = x)$ denotes the <em>observational</em> distribution, which corresponds to the process of seeing, $p(Y \mid do(X = x))$ corresponds to the <em>interventional</em> distribution, which corresponds to the process of doing. The former describes what values $Y$ would likely take on when $X$ <em>happened to be</em> $x$, while the latter describes what values $Y$ would likely take on when $X$ <em>would be set to</em> $x$.</p>
<h2 id="computing-causal-effects">Computing causal effects</h2>
<p>$P(Y \mid do(X = x))$ describes the causal effect of $X$ on $Y$, but how do we compute it? Actually <em>doing</em> the intervention might be unfeasible or unethical — side-stepping actual interventions and still getting at causal effects is the whole point of this approach to causal inference. We want to learn causal effects from observational data, and so all we have is the observational DAG. The causal quantity, however, is defined on the manipulated DAG. We need to build a bridge between the observational DAG and the manipulated DAG, and we do this by making two assumptions.</p>
<p>First, we assume that <em>interventions are local</em>. This means that if I set $X = x$, then this only influences the variable $X$, with no other direct influence on any other variable. Of course, intervening on $X$ will influence other variables, but only through $X$, not directly through us intervening. In colloquial terms, we do not have a “fat hand”, but act like a surgeon precisely targeting only a very specific part of the DAG; we say that the DAG is composed of <em>modular</em> parts. We can encode this using the factorization property above:</p>
<script type="math/tex; mode=display">p(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace ,</script>
<p>which we now interpret causally. The factors in the product are sometimes called <em>causal Markov kernels</em>; they constitute the modular parts of the system.</p>
<p>Second, we assume that the mechanism by which variables interact do not change through interventions; that is, the mechanism by which a cause brings about its effects does not change whether this occurs naturally or by intervention (see e.g., Pearl, Glymour, & Jewell, p. 56).</p>
<p>With these two assumptions in hand, further note that $p(Y \mid do(X = x))$ can be understood as the <em>observational</em> distribution in the manipulated DAG — $p_m(Y \mid X = x)$ — that is, the DAG where we set $X = x$. This is because after <em>doing</em> the intervention (which catapults us into the manipulated DAG), all that is left for us to do is to <em>see</em> its effect. Observe that the leftmost and rightmost DAG above remain the same under intervention on $X$, and so the interventional distribution $p(Y \mid do(X = x))$ is just the conditional distribution $p(Y \mid X = x)$. The middle DAGs require a bit more work:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(Y = y \mid do(X = x)) &= p_{m}(Y = y \mid X = x) \\[.5em]
&= \sum_{z} p_{m}(Y = y, Z = z \mid X = x) \\[.5em]
&= \sum_{z} p_{m}(Y = y \mid X = x, Z = z) \, p_m(Z = z) \\[.5em]
&= \sum_{z} p(Y = y \mid X = x, Z = z) \, p(Z = z) \enspace .
\end{aligned} %]]></script>
<p>The first equality follows by definition. The second and third equality follow from the <em>sum</em> and <em>product</em> rule of probability. The last line follows from the assumption that the mechanism through which $X$ influences $Y$ is independent of whether we set $X$ or whether $X$ naturally occurs, that is, $p_{m}(Y = y \mid X = x, Z = z) = p(Y = y \mid X = x, Z = z)$, and the assumption that interventions are local, that is, $p_m(Z = z) = p(Z = z)$. Thus, the interventional distribution we care about is equal to the conditional distribution of $Y$ given $X$ when we adjust for $Z$. Graphically speaking, this blocks the path $X \leftarrow Z \leftarrow Y$ in the left middle graph and the path $X \leftarrow Z \rightarrow Y$ in the right middle graph. If there were a path $X \rightarrow Y$ in these two latter graphs, and if we would not adjust for $Z$, then the causal effect of $X$ on $Y$ would be <em>confounded</em>. For these simple DAGs, however, it is already clear from the fact that $X$ is independent of $Y$ given $Z$ that $X$ cannot have a causal effect on $Y$. In the next section, study a more complicated graph and look at confounding more closely.</p>
<h2 id="confounding">Confounding</h2>
<p>Confounding has been given various definitions over the decades, but usually denotes the situation where a (possibly unobserved) common cause obscures the causal relationship between two or more variables. Here, we are slightly more precise and call a causal effect of $X$ on $Y$ confounded if $p(Y \mid X = x) \neq p(Y \mid do(X = x))$, which also implies that collider bias is a type of confounding. This occured in the middle two DAGs in the example above, as well as in the chocolate consumption and Nobel Laureates example at the beginning of the blog post. Confounding is the bane of observational data analysis. Helpfully, causal DAGs provide us with a tool to describe multivariate relations between variables. Once we have stated our assumptions clearly, the <em>do</em>-calculus further provides us with a means to know what variables we need to adjust for so that causal effects are unconfounded.</p>
<p>We follow Pearl, Glymour, & Jewell (2016, p. 61) and define the <em>backdoor criterion</em>:</p>
<blockquote>
<p>Given two nodes $X$ and $Y$, an adjustment set $\mathcal{L}$ fulfills the backdoor criterion if no member in $\mathcal{L}$ is a descendant of $X$ and members in $\mathcal{L}$ block all paths between $X$ and $Y$. Adjusting for $\mathcal{L}$ thus yields the causal effect of $X \rightarrow Y$.</p>
</blockquote>
<p>The key observation is that this (a) blocks all spurious, that is, non-causal paths between $X$ and $Y$, (b) leaves all directed paths from $X$ to $Y$ unblocked, and (c) creates no spurious paths.</p>
<div style="float: left;">
<center>
<img src="../assets/img/Large-DAG.png" align="center" style="margin-right: 10px;" width="350" height="125" />
</center>
</div>
<p>To see this action, let’s again look at the DAG on the left. The causal effect of $Z$ on $U$ is confounded by $X$, because in addition to the legitimate causal path $Z \rightarrow Y \rightarrow W \rightarrow U$, there is also an unblocked path $Z \leftarrow X \rightarrow W \rightarrow U$ which confounds the causal effect. The backdoor criterion would have us condition on $X$, which blocks the spurious path and renders the causal effect of $Z$ on $U$ unconfounded. Note that conditioning on $W$ would also block this spurious path; however, it would also block the causal path $Z \rightarrow Y \rightarrow W \rightarrow U$.</p>
<p>Before moving on, let’s catch a quick breath. We have already discussed a number of very important concepts. At the lowest level of the causal hierarchy — association — we have discovered DAGs and $d$-separation as a powerful tool to reason about conditional (in)dependencies between variables. Moving to intervention, the second level of the causal hierarchy, we have satisfied our need to interpret the arrows in a DAG causally. Doing so required strong assumptions, but it allowed us to go beyond <em>seeing</em> and model the outcome of interventions. This hopefully clarified the notion of confounding. In particular, collider bias is a type of confounding, which has important practical implications: we should not blindly enter all variables into a regression in order to “control” for them, but think carefully about what the underlying causal DAG could look like. Otherwise, we might induce spurious associations.</p>
<p>The concepts from causal inference can help us understand methodological phenomena that have been discussed for decades. In the next section, we apply the concepts we have seen so far to make sense of one such phenomenon: <em>Simpson’s Paradox</em>.</p>
<h1 id="example-application-simpsons-paradox">Example Application: Simpson’s Paradox</h1>
<p>This section follows the example given in Pearl, Glymour, & Jewell (2016, Ch. 1) with slight modifications. Suppose you observe $N = 700$ patients who either <em>choose</em> to take a drug or not; note that this is not a randomized control trial. The table below shows the number of recovered patients split across sex.</p>
<center>
<img src="../assets/img/Simpsons-Data-I.png" align="center" style="margin-top: 20px; margin-bottom: 20px;" width="600" height="400" />
</center>
<p>We observe that more men as well as more women recover when taking the drug (93% and 73%) compared to when not taking the drug (87% and 69%). And yet, when taken together, <em>fewer</em> patients who took the drug recovered (78%) compared to patients who did not take the drug (83%). This is puzzling — should a doctor prescribe the drug or not?</p>
<p>To answer this question, we need to compute the causal effect that taking the drug has on the probability of recovery. As a first step, we draw the causal DAG. Suppose we know that women are more likely to take the drug, that being a woman has an effect on recovery more generally, and that the drug has an effect on recovery. Moreover, we know that the <em>treatment cannot cause sex</em>. This is a trivial yet crucial observation — it is impossible to express this in purely statistical language. Causal DAGs provide us with a tool to make such an assumption explicit; the graph below makes explicit that sex ($S$) is a common cause of both drug taking ($D$) and recovery ($R$). We denote $S = 1$ as being female, $D = 1$ as having chosen the drug, and $R = 1$ as having recovered. The left DAG is observational while the right DAG indicates the intervention $do(D = d)$, that is, forcing every patient to either take the drug ($d = 1$) or to not take the drug ($d = 0$).</p>
<center>
<img src="../assets/img/Simpsons-DAG-I.png" align="center" style="margin-right: 10px;" width="600" height="300" />
</center>
<p>We are interested in the probability of recovery if we would force everybody to take, or not take, the drug; we call the difference between these two probabilities the <em>average causal effect</em>. This is key: the <em>do</em>-operator is about populations, not individuals. Using it, we cannot make statements that pertain to the recovery of an individual patient; we can only refer to the probability of recovery as defined on populations of patients. We will discuss <em>individual causal effects</em> in the section on counterfactuals at the end of the blog post.</p>
<p>Computing the average causal effect requires knowledge about the interventional distributions $p(R \mid do(D = 0))$ and $p(R \mid do(D = 1))$. As discussed above, these correspond to the conditional distribution in the manipulated DAG which is shown above on the right. The backdoor criterion tells us that the conditional distribution in the observational DAG will correspond to the interventional distribution when blocking the spurious path $D \leftarrow S \rightarrow R$. Using the adjustment formula we have derived above, we expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(R = 1 \mid do(D = 1)) &= \sum_{s} p(R = 1\mid D = 1, S = s) \, p(S = s) \\[.5em]
&= p(R = 1\mid D = 1, S = 0) \, p(S = 0) + p(R = 1\mid D = 1, S = 1) \, p(S = 1) \\[.5em]
&= \frac{81}{87} \times \frac{87 + 270}{700} + \frac{192}{263} \times \frac{263 + 80}{700} \\[.5em]
&\approx 0.832 \enspace .
\end{aligned} %]]></script>
<p>In words, we first compute the benefit of taking the drug separately for men and women, and then we average the result by weighting it with the fraction of men and women in the population. This tells us that, if we force everybody to take the drug, about $82\%$ of people will recover. We can similarly compute the probability of recovery given we force all patients to not choose the drug:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(R = 1\mid do(D = 0)) &= \sum_{s} p(R = 1\mid D = 0, S = s) \, p(S = s) \\[.5em]
&= p(R = 1\mid D = 0, S = 0) \, p(S = 0) + p(R = 1\mid D = 0, S = 1) \, p(S = 1) \\[.5em]
&= \frac{243}{270} \times \frac{87 + 270}{700} + \frac{55}{80} \times \frac{263 + 80}{700} \\[.5em]
&\approx 0.782 \enspace .
\end{aligned} %]]></script>
<p>Therefore, taking the drug does indeed have a positive effect on recovery on average, and the doctor should prescribe the drug.</p>
<p>Note that this conclusion heavily depended on the causal graph. While graphs are wonderful tools in that they make our assumptions explicit, these assumptions are — of course — not at all guaranteed to be correct. These assumptions are strong, stating that the graph must encode all causal relations between variables, and that there is no unmeasured confounding, something we can never guarantee in observational data.</p>
<p>Let’s look at a different example but with the exact same data. In particular, instead of the variable sex we look at the <em>post-treatment</em> variable blood pressure. This means we have measured blood pressure after the patients have taken the drug. Should a doctor prescribe the drug or not?</p>
<center>
<img src="../assets/img/Simpsons-Data-II.png" align="center" style="margin-top: 20px; margin-bottom: 20px;" width="700" height="550" />
</center>
<p>Since blood pressure is a post-treatment variable, it cannot influence a patient’s decision to take the drug or not. We draw the following causal DAG, which makes clear that the drug has an indirect effect on recovery through blood pressure, in addition to having a direct causal effect.<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></p>
<center>
<img src="../assets/img/Simpsons-DAG-II.png" align="center" style="margin-right: 10px;" width="600" height="300" />
</center>
<p>From this DAG, we find that the causal effect of $D$ on $R$ is unconfounded. Therefore, the two causal quantities of interest are given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(R = 1 \mid do(D = 1)) &= p(R = 1 \mid D = 1) = 0.78 \\[.5em]
p(R = 1 \mid do(D = 0)) &= p(R = 1 \mid D = 0) = 0.83 \enspace .
\end{aligned} %]]></script>
<p>This means that the drug is indeed harmful. In the general population (combined data), the drug has a negative effect. Suppose that the drug has a direct positive effect on recovery, but an indirect negative effect through blood pressure. If we only look at patients with a particular blood pressure, then only the drug’s positive effect on recovery remains. However, since the drug does influence recovery negatively through blood pressure, it would be misleading to take the association between $D$ and $R$ conditional on $Z$ as our estimate for the causal effect. In contrast to the previous example, using the aggregate data is the correct way to analyze these data in order to estimate the average causal effect.</p>
<p>So far, our treatment has been entirely model-agnostic. In the next section, we discuss Structural Causal Models (SCM) as the fundamental building block of causal inference. This will unify the previous two levels of the causal hierarchy — <em>seeing</em> and <em>doing</em> — as well as open up the third and final level: counterfactuals.</p>
<h1 id="structural-causal-models">Structural Causal Models</h1>
<p>In this section, we discuss Structural Causal Models (SCM) as the fundamental building block of causal inference. SCMs relate causal and probabilistic statements. As an example, we specify:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
X &:= \epsilon_X \\[.5em]
Y &:= f(X, \epsilon_Y) \enspace .
\end{aligned} %]]></script>
<p>$X$ is a direct cause of $Y$ which it influences through the function $f()$, and the noise variables $\epsilon_X$ and $\epsilon_Y$ are assumed to be independent. In a SCM, we take each equation to be a causal statement, and we stress this by using the assignment symbol $:=$ instead of the equality sign $=$. Note that this is in stark contrast to standard regression models; here, we explicitly state our causal assumptions.</p>
<p>As we will see below, Structural Causal Models imply observational distributions (<em>seeing</em>), interventional distributions (<em>doing</em>), as well as counterfactuals (<em>imagining</em>). Thus, they can be seen as the fundamental building block of this approach to causal inference. In the following, we restrict the class of Structural Causal Models by allowing only linear relationships between variables and assuming independent Gaussian error terms.<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> As an example, take the following SCM (Peters, Janzing, & Schölkopf, 2017, p. 90):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
X &:= \epsilon_X \\[.5em]
Y &:= X + \epsilon_Y \\[.5em]
Z &:= Y + \epsilon_Z \enspace ,
\end{aligned} %]]></script>
<p>where $\epsilon_X, \epsilon_Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1)$ and $\epsilon_Z \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 0.1)$. Again, each line explicates the causal link variables. For example, we assume that $X$ has a direct causal effect on $Y$, that this effect is linear, and that it is obscured by independent Gaussian noise.</p>
<p>The assumption of Gaussian errors induces a multivariate Gaussian distribution on $(X, Y, Z)$ whose independence structure is visualized in the leftmost DAG below. The middle DAG shows an intervention on $Z$, while the rightmost DAG shows an intervention on $X$. Recall that, as discussed above, intervening on a variable cuts all incoming arrows.</p>
<center>
<img src="../assets/img/Prediction-vs-Intervention.png" align="center" style="margin-right: 10px;" width="700" height="400" />
</center>
<p>At the first level of the causal hierarchy — association — we might ask ourselves: does $X$ or $Z$ predict $Y$ better? To illustrate the answer for our example, we simulate $n = 1000$ observations from the Structural Causal model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span></code></pre></figure>
<p>The figure below shows that $Y$ has a much stronger association with $Z$ than with $X$; this is because the standard deviation of the error $\epsilon_X$ is only a tenth of the standard deviation of the error $\epsilon_Z$. For prediction, therefore, $Z$ is the more relevant variable.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>But does $Z$ actually have a causal effect on $Y$? This is a question about intervention, which is squarely located at the second level of the causal hierarchy. With the knowledge of the underlying Structural Causal Model, we can easily simulate interventions in R and visualize their outcomes:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Simulate data from the SCM where do(Z = z)</span><span class="w">
</span><span class="n">intervene_z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">cbind</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Simulate data from the SCM where do(X = x)</span><span class="w">
</span><span class="n">intervene_x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">cbind</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">datz</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">intervene_z</span><span class="p">(</span><span class="n">z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">datx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">intervene_x</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Y'</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-6</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray76'</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'P(Y)'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">datz</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Y'</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-6</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray76'</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'P(Y | do(Z = 2))'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">datx</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Y'</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-6</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray76'</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'P(Y | do(X = 2))'</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>The leftmost histogram below shows the marginal distribution of $Y$ when no intervention takes place. The histogram in the middle shows the marginal distribution of $Y$ in the manipulated DAG where we set $Z = 2$. Observe that, as indicated by the causal graph, $Z$ does not have a causal effect on $Y$ such that $p(Y \mid do(Z = 2)) = p(Y)$. The histogram on the right shows the marginal distribution of $Y$ in the manipulated DAG where we set $X = 2$.</p>
<p>Clearly, then, $X$ has a causal effect on $Y$. While we have touched on it already when discussing Simpson’s paradox, we now formally define the <em>Average Causal Effect</em>:</p>
<script type="math/tex; mode=display">\text{ACE}(X \rightarrow Y) = \mathbb{E}\left[Y \mid do(X = x + 1)\right] - \mathbb{E}\left[Y \mid do(X = x)\right] \enspace ,</script>
<p>which in our case equals one, as can also be seen from the Structural Causal Model. Thus, SCMs allow us to model the outcome of interventions.<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup> However, note again that this is strictly about populations, not individuals. In the next section, we see how SCMs can allow us to climb up to the final level of the causal hierarchy, moving beyond the average to define individual causal effects.</p>
<h1 id="counterfactuals">Counterfactuals</h1>
<p>In the <em>Unbearable Lightness of Being</em>, Milan Kundera has Tomáš ask himself:</p>
<blockquote>
<p>“Was it better to be with Tereza or to remain alone?”</p>
</blockquote>
<p>To which he answers:</p>
<blockquote>
<p>“There is no means of testing which decision is better, because there is no basis for comparison. We live everything as it comes, without warning, like an actor going on cold. And what can life be worth if the first rehearsal for life is life itself?”</p>
</blockquote>
<p>Kundera is describing, as Holland (1986, p. 947) put it, the “fundamental problem of causal inference”, namely that we only ever observe one realization. If Tomáš chooses to stay with Tereza, then he cannot not choose to stay with Tereza. He cannot go back in time and revert his decision, living instead “everything as it comes, without warning”. This does not mean, however, that Tomáš cannot assess afterwards whether his choice has been wise. As a matter of fact, humans constantly evaluate mutually exclusive options, only one of which ever comes true; that is, humans reason <em>counterfactually</em>.</p>
<p>To do this formally requires strong assumptions. The <em>do</em>-operator, introduced above, is too weak to model counterfactuals. This is because it operates on distributions that are defined on populations, not on individuals. We can define an average causal effect using the <em>do</em>-operator, but — unsurprisingly — it only ever refers to averages. Structural Causal Models allow counterfactual reasoning on the level of the individual. To see this, we use a simple example.</p>
<p>Suppose we want to study the causal effect of grandma’s treatment for the common cold ($T$) on the speed of recovery ($R$). Usually, people recover from the common cold in <a href="https://en.wikipedia.org/wiki/Common_cold">seven to ten days</a>, but grandma swears she can do better with a simple intervention — we agree on doing an experiment. Assume we have the following SCM:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T &:= \epsilon_T \\[.5em]
R &:= \mu + \beta T + \epsilon \enspace ,
\end{aligned} %]]></script>
<p>where $\mu$ is an intercept, $\epsilon_T \sim \text{Bern}(0.50)$ indicates random assignment to either receive the treatment ($T = 1$) or not receive it ($T = 0$), and $\epsilon \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma)$. The SCM tells us that the direct causal effect of the treatment on how quickly patients recover from the common cold is $\beta$. This causal effect is obscured by individual error terms for each patient $\epsilon = (\epsilon_1, \epsilon_2, \ldots, \epsilon_N)$, which are aggregate terms for all the things left unmodelled (see <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">this</a> blog post for some history). In particular, $\epsilon_k$ summarizes all the things that have an effect on the speed of recovery for patient $k$.</p>
<p>Once we have collected the data, suppose we find that $\mu = 7$, $\beta = -2$, and $\sigma = 2$. This does speak for grandma’s treatment, since it shortens the recovery time by 2 days on average:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ACE}(T \rightarrow R) &= \mathbb{E}\left[R \mid do(T = 1)\right] - \mathbb{E}\left[R \mid do(T = 0)\right] \\[.5em]
&= \mathbb{E}\left[\mu + \beta + \epsilon\right] - \mathbb{E}\left[\mu + \epsilon\right] \\[.5em]
&= \left(\mu + \beta\right) - \mu \\[.5em]
&= \beta \enspace .
\end{aligned} %]]></script>
<p>Given the value for $\epsilon_k$, the Structural Causal Model is fully determined, and we may write $R(\epsilon_k)$ for the speed of recovery for patient $k$. To make this example more concrete, we simulate some data in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="c1"># Structural Causal Model</span><span class="w">
</span><span class="n">e_T</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">)</span><span class="w">
</span><span class="n">e</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="nb">T</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">e_T</span><span class="w">
</span><span class="n">R</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="nb">T</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">e</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">,</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">dat</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## T R e
## [1,] 0 5.7962118 0.7962118
## [2,] 0 3.7759472 -1.2240528
## [3,] 1 3.6822394 0.6822394
## [4,] 1 0.7412738 -2.2587262
## [5,] 0 7.8660474 2.8660474
## [6,] 1 6.9607998 3.9607998</code></pre></figure>
<p>We see that the first patient did not receive the treatment ($T = 0$), took about $R = 5.80$ days to recover from the common cold, and has a unique value $\epsilon_1 = 0.78$. Would this particular patient have recovered more quickly if we had given him grandma’s treatment even though we did not? We denote this quantity of interest as $R_{T = 1}(\epsilon_1)$ to contrast it with the actually observed $R_{T = 0}(\epsilon_1)$. To compute this seemingly otherworldly quantity, we simply plug the value $T = 1$ and $\epsilon_1 = 0.78$ into our Structural Causal Model, which yields:</p>
<script type="math/tex; mode=display">R_{T = 0}(\epsilon_1) = 5 - 2 + 0.78 = 3.78 \enspace .</script>
<!-- There is one remaining complication. Since $\epsilon_1 \sim \mathcal{N}(0, \sigma)$, that is, there is uncertainty as to the effect of unmodelled factors, the counterfactual quantity $R_{T = 1}(\epsilon_1)$ is not deterministic but stochastic. We can average over this uncertainty by taking the expectation, which yields the expected duration of the recovery for patient $k = 1$ when given the treatment, even though the patient did not receive the treatment and had an actual recovery speed of $5.80$. Formally, this is: -->
<!-- $$ -->
<!-- \mathbb{E}\left[R_{T = 1} \mid T = 0, R = 5.80\right] = \mathbb{E}\left[5 - 2 + \epsilon_1\right] = 3 \enspace . -->
<!-- $$ -->
<p>Using this, we can define the <em>individual causal effect</em> as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ICE}(R \rightarrow T) &= R_{T = 1}(\epsilon_1) - R_{T = 0}(\epsilon_1) \\[.5em]
&= 5.78 - 3.78 \\[.5em]
&= 2 \enspace ,
\end{aligned} %]]></script>
<p>which in this example is equal to the average causal effect due to the <a href="https://stats.stackexchange.com/a/385558">linearity of the underlying SCM</a> (Pearl, Glymour, & Jewell 2016, p. 106). In general, individual causal effects are not identified, and we have to resort to average causal effects.<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></p>
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{ICE}(T \rightarrow R) &= \mathbb{E}\left[R_{T = 1} \mid T = 1, R = 5.80\right] - \mathbb{E}\left[R_{T = 1} \mid T = 0, R = 5.80\right] \\[.5em] -->
<!-- &= \mathbb{E}\left[5 - 2 + \epsilon_1\right] - \mathbb{E}\left[5 + \epsilon_1\right] \\[.5em] -->
<!-- &= 3 \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<p>Answering the question of whether a particular patient would have recovered more quickly had we given him the treatment even though we did not give him the treatment seems almost fantastical. It is a <em>cross-world</em> statement: given what we have observed, we ask about what would have been if things had turned out different. It may strike you as a bit eerie to speak about different worlds. Peters, Janzing, & Schölkopf (2017, p. 106) state that it is “debatable whether this additional [counterfactual] information [encoded in the SCM] is useful.” It certainly requires strong assumptions. More broadly, Dawid (2000) argues in favour of causal inference without counterfactuals, and he does not seem to have shifted his position in <a href="https://twitter.com/fdabl/status/1110944752571158528">recent years</a>. Yet if we want to design machines that can achieve human level reasoning, we need to endow them with counterfactual thinking (Pearl, 2019a). Moreover, many concepts that a relevant in legal and ethical domains, such as fairness (Kusner et al., 2017), require counterfactuals.</p>
<p>Before we end, note that the graphical approach to causal inference outlined in this blog post is not the only game in town. The <em>potential outcome</em> framework for causal inference developed by <a href="https://en.wikipedia.org/wiki/Rubin_causal_model">Donald Rubin</a> and others avoids graphical models and takes counterfactual quantities as primary. However, although starting from counterfactual statements that are defined at the individual level, it is my understand that most work that uses potential outcomes focuses on <em>average causal effects</em>. As outlined above, this only requires the second level of the causal hierarchy — <em>doing</em> — and are therefore much less contentious than <em>individual causal effects</em>, which sit at the top of the causal hierarchy.</p>
<p>The graphical approach outlined in this blog post and the potential outcome framework are logically equivalent (Peters, Janzing, & Schölkopf, 2017, p. 125), and although there is quite some debate surrounding the two approaches, it is probably wise to be pragmatic and simply choose the tool that works best for a particular application. As Lauritzen (2004, p. 189) put it, he sees the</p>
<blockquote>
<p>“different formalisms as different ‘languages’. The French language may be best for making love whereas the Italian may be more suitable for singing, but both are indeed possible, and I have no difficulty accepting that potential responses, structural equations, and graphical models coexist as languages expressing causal concepts each with their virtues and vices.<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup>”</p>
</blockquote>
<p>For further reading, I wholeheartedly recommend the textbooks by Pearl, Glymour, & Jewell (<a href="http://bayes.cs.ucla.edu/PRIMER/">2016</a>) as well as Peters, Janzing, & Schölkopf (<a href="https://mitpress.mit.edu/books/elements-causal-inference">2017</a>). For bedtime reading, I can recommend Pearl & McKenzie (<a href="https://www.goodreads.com/book/show/36204378-the-book-of-why">2018</a>). Miguel Hernán teaches an excellent introductory online course on causal diagrams <a href="https://www.edx.org/course/causal-diagrams-draw-your-assumptions-before-your">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have touched on several key concepts of causal inference. We have started with the puzzling observation that chocolate consumption and the number of Nobel Laureates are strongly positively related. At the lowest level of the causal ladder — association — we have seen how directed acyclic graphs can help us visualize conditional independencies, and how <em>d</em>-separation provides us with an algorithmic tool to check such independencies.</p>
<p>Moving up to the second level — intervention — we have seen how the <em>do</em>-operator models populations under interventions. This helped us define <em>confounding</em> — the bane of observational data analysis — as occuring when $p(Y \mid X = x) \neq p(Y \mid do(X = x))$. This comes with the important observation that entering all variables into a regression in order to “control” for them is misguided; rather, we need to carefully think about the underlying causal relations lest we want to introduce bias by for example conditioning on a collider. The <em>backdoor criterion</em> provided us with a graphical way to assess whether an effect is confounded or not.</p>
<p>Finally, we have seen that Structural Causal Models (SCMs) provide the building block from which observational and interventional distributions follow. SCMs further imply counterfactual statements, which sit at the top of the causal hierarchy. These allow us to move beyond the <em>do</em>-operator and average causal effects: they enable us to answer questions about what would have been if things had been different.</p>
<hr />
<p><em>I would like to thank <a href="https://ryanoisin.github.io/">Oisín Ryan</a> and <a href="https://cruwell.com/">Sophia Crüwell</a> for very helpful comments on this blog.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Bollen, K. A., & Pearl, J. (<a href="https://link.springer.com/chapter/10.1007/978-94-007-6094-3_15">2013</a>). Eight myths about causality and structural equation models. In <em>Handbook of Causal Analysis for Social Research</em> (pp. 301-328). Springer, Dordrecht.</li>
<li>Cartwright, N. (2007). <em>Hunting Causes and Using them: Approaches in Philosophy and Economics</em>. Cambridge University Press.</li>
<li>Dawid, A. P. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2000.10474210">2000</a>). Causal inference without counterfactuals. <em>Journal of the American Statistical Association, 95</em>(450), 407-424.</li>
<li>Hernán, M.A., & Robins J.M. (<a href="https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/">2020</a>). <em>Causal Inference: What If</em>. Boca Raton: Chapman & Hall/CRC.</li>
<li>Holland, P. W. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1986.10478354">1986</a>). Statistics and causal inference. <em>Journal of the American statistical Association, 81</em>(396), 945-960.</li>
<li>Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (<a href="https://papers.nips.cc/paper/6995-counterfactual-fairness">2017</a>). Counterfactual fairness. In <em>Advances in Neural Information Processing Systems</em> (pp. 4066-4076).</li>
<li>Lauritzen, S. L., Aalen, O. O., Rubin, D. B., & Arjas, E. (<a href="https://www.jstor.org/stable/4616823?casa_token=aseDj2RNgjcAAAAA:iJpo1EhqcVN_89UT2AMMR0FynAC9mnake3YgBbFUoG81rNn8jbVQcQTs6NJdt3l3XDOQDRBreeILUOpvNrRglQ8CR6HQuHbg7x_F6CIIdaK_rTVfFfZMUg&seq=1#metadata_info_tab_contents">2004</a>). Discussion on Causality [with Reply]. <em>Scandinavian Journal of Statistics, 31</em>(2), 189-201.</li>
<li>Pearl, J. (<a href="https://dl.acm.org/citation.cfm?id=3241036">2019a</a>). The seven tools of causal inference, with reflections on machine learning. <em>Commun. ACM, 62</em>(3), 54-60.</li>
<li>Pearl, J. (<a href="https://www.degruyter.com/view/j/jci.2019.7.issue-1/jci-2019-2002/jci-2019-2002.xml">2019b</a>). On the Interpretation of do (x) do (x). <em>Journal of Causal Inference, 7</em>(1).</li>
<li>Pearl, J. (<a href="https://ftp.cs.ucla.edu/pub/stat_ser/r370.pdf">2012</a>). The Causal Foundations of Structural Equation Modeling.</li>
<li>Pearl, J., Glymour, M., & Jewell, N. P. (<a href="http://bayes.cs.ucla.edu/PRIMER/">2016</a>). Causal Inference in Statistics: A Primer. John Wiley & Sons.</li>
<li>Peters, J., Janzing, D., & Schölkopf, B. (<a href="https://mitpress.mit.edu/books/elements-causal-inference">2017</a>). <em>Elements of Causal Inference: Foundations and Learning Algorithms</em>. MIT Press.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The content of this blog post is an extended write-up of a one-hour lecture I gave to $3^{\text{rd}}$ year psychology undergraduate students at the University of Amsterdam. You can view the presentation, which includes exercises at the end, <a href="https://fabiandablander.com/assets/talks/Causal-Lecture">here</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Messerli (2012) was the first to look at this relationship. The data I present here are somewhat different. I include Nobel Laureates up to 2019, and I use the 2017 chocolate consumption data as reported <a href="https://www.statista.com/statistics/819288/worldwide-chocolate-consumption-by-country/">here</a>. You can download the data set <a href="https://fabiandablander.com/assets/data/nobel-chocolate.csv">here</a>. To get the data reported by Messerli (2012) into R, you can follow <a href="http://gforge.se/2012/12/chocolate-and-nobel-prize/">this</a> blogpost. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I can recommend <a href="https://www.edx.org/course/causal-diagrams-draw-your-assumptions-before-your">this</a> course on causal diagrams by Miguel Hernán to get more intuition for causal graphical models. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>If the converse implication, that is, the implication from the distribution to the graph holds, we say that the graph is <em>faithful</em> to the distribution. This is an important assumption in causal learning because it allows one to estimate causal relations from conditional independencies in the data. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>A causal effect is <em>direct</em> only at at particular level of abstraction. The drug works by inducing certain biochemical reactions that might themselves be described by DAGs. On a finer scale, then, the direct effect seizes to be direct. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>Structural Causal Models are closely related to Structural Equation Models. The latter allow latent variables, but their causal content has been debated throughout the last century. For more information, see for example Pearl (2012) and Bollen & Pearl (2013). <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>For the interpretation of the <em>do</em>-operator for non-manipulable causes, see Pearl (2019b). <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>Here, we have focused on <em>deterministic</em> counterfactuals, assigning a single value to the counterfactual $R_{T = 1}(\epsilon_1)$. This is in contrast to <em>stochastic</em> or <em>non-deterministic</em> counterfactuals, which follow a distribution. This distinction does not matter for average causal effects, but it does for individual ones (Hernán & Robins, 2020, p. 10). <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>One can only hope that Bayesians and Frequentists become inspired by the pragmatism expressed here so poetically by Lauritzen. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderAn extended version of this blog post is available from here. Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. In this blog post, I provide an introduction to the graphical approach to causal inference in the tradition of Sewell Wright, Judea Pearl, and others. We first rehash the common adage that correlation is not causation. We then move on to climb what Pearl calls the “ladder of causal inference”, from association (seeing) to intervention (doing) to counterfactuals (imagining). We will discover how directed acyclic graphs describe conditional (in)dependencies; how the do-calculus describes interventions; and how Structural Causal Models allow us to imagine what could have been. This blog post is by no means exhaustive, but should give you a first appreciation of the concepts that surround causal inference; references to further readings are provided below. Let’s dive in!1 Correlation and Causation Messerli (2012) published a paper entitled “Chocolate Consumption, Cognitive Function, and Nobel Laureates” in The New England Journal of Medicine showing a strong positive relationship between chocolate consumption and the number of Nobel Laureates. I have found an even stronger relationship using updated data2, as visualized in the figure below. Now, except for people in the chocolate business, it would be quite a stretch to suggest that increasing chocolate consumption would increase the number Nobel Laureates. Correlation does not imply causation because it does not constrain the possible causal relations enough. Hans Reichenbach (1956) formulated the common cause principle which speaks to this fact: If two random variables $X$ and $Y$ are statistically dependent ($X \not \perp Y$), then either (a) $X$ causes $Y$, (b) $Y$ causes $X$, or (c) there exists a third variable $Z$ that causes both $X$ and $Y$. Further, $X$ and $Y$ become independent given $Z$, i.e., $X \perp Y \mid Z$. An in principle straightforward way to break this uncertainty is to conduct an experiment: we could, for example, force the citizens of Austria to consume more chocolate, and study whether this increases the number of Nobel laureates in the following years. Such experiments are clearly unfeasible, but even in less extreme settings it is frequently unethical, impractical, or impossible — think of smoking and lung cancer — to study an association experimentally. Causal inference provides us with tools that license causal statements even in the absence of a true experiment. This comes with strong assumptions. In the next section, we discuss the “causal hierarchy”. The Causal Hierarchy Pearl (2019a) introduces a causal hierarchy with three levels — association, intervention, and counterfactuals — as well as three prototypical actions corresponding to each level — seeing, doing, and imagining. In the remainder of this blog post, we will tackle each level in turn. Seeing Association is on the most basic level; it makes us see that two or more things are somehow related. Importantly, we need to distinguish between marginal associations and conditional associations. The latter are the key building block of causal inference. The figure below illustrates these two concepts. If we look at the whole, aggregated data on the left we see that the continuous variables $X$ and $Y$ are positively correlated: an increase in values for $X$ co-occurs with an increase in values for $Y$. This relation describes the marginal association of $X$ and $Y$ because we do not care whether $Z = 0$ or $Z = 1$. On the other hand, if we condition on the binary variable $Z$, we find that there is no relation: $X \perp Y \mid Z$. (For more on marginal and conditional associations in the case of Gaussian distributions, see this blog post). In the next section, we discuss a powerful tool that allows us to visualize such dependencies. Directed Acyclic Graphs We can visualize the statistical dependencies between the three variables using a graph. A graph is a mathematical object that consists of nodes and edges. In the case of Directed Acyclic Graphs (DAGs), these edges are directed. We take our variables $(X, Y, Z$) to be nodes in such DAG and we draw (or omit) edges between these nodes so that the conditional (in)dependence structure in the data is reflected in the graph. We will explain this more formally shortly. For now, let’s focus on the relationship between the three variables. We have seen that $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. It turns out that we can draw three DAGs that encode this fact; these are the first three DAGs in the figure below. $X$ and $Y$ are dependent through $Z$ in these graphs, and conditioning on $Z$ blocks the path between $X$ and $Y$. We state this more formally shortly. While it is natural to interpret the arrows causally, we do not do so here. For now, the arrows are merely tools that help us describe associations between variables. The figure above also shows a fourth DAG, which encodes a different set of conditional (in)dependence relations between $X$, $Y$, and $Z$. The figure below illustrates this: looking at the aggregated data we do not find a relation between $X$ and $Y$ — they are marginally independent — but we do find one when looking at the disaggregated data — $X$ and $Y$ are conditionally dependent given $Z$. A real-world example might help build intuition: Looking at people who are single and who are in a relationship as a separate group, being attractive ($X$) and being intelligent ($Y$) are two independent traits. This is what we see in the left panel in the figure above. Let’s make the reasonable assumption that both being attractive and being intelligent are positively related with being in a relationship. What does this imply? First, it implies that, on average, single people are less attractive and less intelligent (see red data points). Second, and perhaps counter-intuitively, it implies that in the population of single people (and people in a relationship, respectively), being attractive and being intelligent are negatively correlated. After all, if the handsome person you met at the bar were also intelligent, then he would most likely be in a relationship! In this example, visualized in the fourth DAG, $Z$ is commonly called a collider. Suppose we want to estimate the association between $X$ and $Y$ in the whole population. Conditioning on a collider (for example, by only analyzing data from people who are not in a relationship) while computing the association between $X$ and $Y$ will lead to a different estimate, and the induced bias is known as collider bias. It is a serious issue not only in dating, but also for example in medicine. The simple graphs shown above are the building blocks of more complicated graphs. In the next section, we describe a tool that can help us find (conditional) independencies between sets of variables.3 $d$-separation For large graphs, it is not obvious how to conclude that two nodes are (conditionally) independent. d-separation is a tool that allows us to check this algorithmically. We need to define some concepts: A path from $X$ to $Y$ is a sequence of nodes and edges such that the start and end nodes are $X$ and $Y$, respectively. A conditioning set $\mathcal{L}$ is the set of nodes we condition on (it can be empty). A collider along a path blocks that path. However, conditioning on a collider (or any of its descendants) unblocks that path. With these definitions out of the way, we call two nodes $X$ and $Y$ $d$-separated by $\mathcal{L}$ if conditioning on all members in $\mathcal{L}$ blocks all paths between the two nodes. If this is your first encounter with $d$-separation, then this is a lot to wrap your head around. To get some practice, look at the graph on the left side. First, note that there are no marginal dependencies; this means that without conditioning or blocking nodes, any two nodes are connected by a path. For example, there is a path going from $X$ to $Y$ through $Z$, and there is a path from $V$ to $U$ going through $Y$ and $W$. However, there are a number of conditional independencies. For example, $X$ and $Y$ are conditionally independent given $Z$. Why? There are two paths from $X$ to $Y$: one through $Z$ and one through $W$. However, since $W$ is a collider on the path from $X$ to $Y$, the path is already blocked. The only unblocked path from $X$ to $Y$ is through $Z$, and conditioning on it therefore blocks all remaining open paths. Additionally conditioning on $W$ would unblock one path, and $X$ and $Y$ would again be associated. So far, we have implicitly assumed that conditional (in)dependencies in the graph correspond to conditional (in)dependencies between variables. We make this assumption explicit now. In particular, note that d-separation provides us with an independence model $\perp_{\mathcal{G}}$ defined on graphs. To connect this to our standard probabilistic independence model $\perp_{\mathcal{P}}$ defined on random variables, we assume the following Markov property: In words, we assume that if the nodes $X$ and $Y$ are d-separated by $Z$ in the graph $\mathcal{G}$, the corresponding random variables $X$ and $Y$ are conditionally independent given $Z$. This implies that all conditional independencies in the data are represented in the graph.4 Moreover, the statement above implies (and is implied by) the following factorization: where $\text{pa}^{\mathcal{G}}(X_i)$ denotes the parents of the node $X_i$ in graph $\mathcal{G}$ (see Peters, Janzing, & Schölkopf, p. 101). A node is a parent of another node if it has an outgoing arrow to that node; for example, $X$ is a parent of $Z$ and $W$ in the graph above. The above factorization implies that a node $X$ is independent of its non-descendants given its parents. d-separation is an extremely powerful tool. Until now, however, we have only looked at DAGs to visualize (conditional) independencies. In the next section, we go beyond seeing to doing. Doing We do not merely want to see the world, but also change it. From this section on, we are willing to interpret DAGs causally. As Dawid (2009a) warns, this is a serious step. In merely describing conditional independencies — seeing — the arrows in the DAG played a somewhat minor role, being nothing but “incidental construction features supporting the $d$-separation semantics” (Dawid, 2009a, p. 66). In this section, we endow the DAG with a causal meaning and interpret the arrows as denoting direct causal effects. What is a causal effect? Following Pearl and others, we take an interventionist position and say that a variable $X$ has a causal influence on $Y$ if changing $X$ leads to changes in $Y$. This position is a very useful one in practice, but not everybody agrees with it (e.g., Cartwright, 2007). The figure below shows the observational DAGs from above (top row) as well as the manipulated DAGs (bottom row) where we have intervened on the variable $X$, that is, set the value of the random variable $X$ to a constant $x$. Note that setting the value of $X$ cuts all incoming causal arrows since its value is thereby determined only by the intervention, not by any other factors. As is easily verified with $d$-separation, the first three graphs in the top row encode the same conditional independence structure. This implies that we cannot distinguish them using only observational data. Interpreting the edges causally, however, we see that the DAGs have a starkly different interpretation. The bottom row makes this apparent by showing the result of an intervention on $X$. In the leftmost causal DAG, $Z$ is on the causal path from $X$ to $Y$, and intervening on $X$ therefore influences $Y$ through $Z$. In the DAG next, to it $Z$ is on the causal path from $Y$ to $X$, and so intervening on $X$ does not influence $Y$. In the third DAG, $Z$ is a common cause and — since there is no other path from $X$ to $Y$ — intervening on $X$ does not influence $Y$. For the collider structure in the rightmost DAG, intervening on $X$ does not influence $Y$ because there is no unblocked path from $X$ to $Y$. To make the distinction between seeing and doing, Pearl introduced the do-operator. While $p(Y \mid X = x)$ denotes the observational distribution, which corresponds to the process of seeing, $p(Y \mid do(X = x))$ corresponds to the interventional distribution, which corresponds to the process of doing. The former describes what values $Y$ would likely take on when $X$ happened to be $x$, while the latter describes what values $Y$ would likely take on when $X$ would be set to $x$. Computing causal effects $P(Y \mid do(X = x))$ describes the causal effect of $X$ on $Y$, but how do we compute it? Actually doing the intervention might be unfeasible or unethical — side-stepping actual interventions and still getting at causal effects is the whole point of this approach to causal inference. We want to learn causal effects from observational data, and so all we have is the observational DAG. The causal quantity, however, is defined on the manipulated DAG. We need to build a bridge between the observational DAG and the manipulated DAG, and we do this by making two assumptions. First, we assume that interventions are local. This means that if I set $X = x$, then this only influences the variable $X$, with no other direct influence on any other variable. Of course, intervening on $X$ will influence other variables, but only through $X$, not directly through us intervening. In colloquial terms, we do not have a “fat hand”, but act like a surgeon precisely targeting only a very specific part of the DAG; we say that the DAG is composed of modular parts. We can encode this using the factorization property above: which we now interpret causally. The factors in the product are sometimes called causal Markov kernels; they constitute the modular parts of the system. Second, we assume that the mechanism by which variables interact do not change through interventions; that is, the mechanism by which a cause brings about its effects does not change whether this occurs naturally or by intervention (see e.g., Pearl, Glymour, & Jewell, p. 56). With these two assumptions in hand, further note that $p(Y \mid do(X = x))$ can be understood as the observational distribution in the manipulated DAG — $p_m(Y \mid X = x)$ — that is, the DAG where we set $X = x$. This is because after doing the intervention (which catapults us into the manipulated DAG), all that is left for us to do is to see its effect. Observe that the leftmost and rightmost DAG above remain the same under intervention on $X$, and so the interventional distribution $p(Y \mid do(X = x))$ is just the conditional distribution $p(Y \mid X = x)$. The middle DAGs require a bit more work: The first equality follows by definition. The second and third equality follow from the sum and product rule of probability. The last line follows from the assumption that the mechanism through which $X$ influences $Y$ is independent of whether we set $X$ or whether $X$ naturally occurs, that is, $p_{m}(Y = y \mid X = x, Z = z) = p(Y = y \mid X = x, Z = z)$, and the assumption that interventions are local, that is, $p_m(Z = z) = p(Z = z)$. Thus, the interventional distribution we care about is equal to the conditional distribution of $Y$ given $X$ when we adjust for $Z$. Graphically speaking, this blocks the path $X \leftarrow Z \leftarrow Y$ in the left middle graph and the path $X \leftarrow Z \rightarrow Y$ in the right middle graph. If there were a path $X \rightarrow Y$ in these two latter graphs, and if we would not adjust for $Z$, then the causal effect of $X$ on $Y$ would be confounded. For these simple DAGs, however, it is already clear from the fact that $X$ is independent of $Y$ given $Z$ that $X$ cannot have a causal effect on $Y$. In the next section, study a more complicated graph and look at confounding more closely. Confounding Confounding has been given various definitions over the decades, but usually denotes the situation where a (possibly unobserved) common cause obscures the causal relationship between two or more variables. Here, we are slightly more precise and call a causal effect of $X$ on $Y$ confounded if $p(Y \mid X = x) \neq p(Y \mid do(X = x))$, which also implies that collider bias is a type of confounding. This occured in the middle two DAGs in the example above, as well as in the chocolate consumption and Nobel Laureates example at the beginning of the blog post. Confounding is the bane of observational data analysis. Helpfully, causal DAGs provide us with a tool to describe multivariate relations between variables. Once we have stated our assumptions clearly, the do-calculus further provides us with a means to know what variables we need to adjust for so that causal effects are unconfounded. We follow Pearl, Glymour, & Jewell (2016, p. 61) and define the backdoor criterion: Given two nodes $X$ and $Y$, an adjustment set $\mathcal{L}$ fulfills the backdoor criterion if no member in $\mathcal{L}$ is a descendant of $X$ and members in $\mathcal{L}$ block all paths between $X$ and $Y$. Adjusting for $\mathcal{L}$ thus yields the causal effect of $X \rightarrow Y$. The key observation is that this (a) blocks all spurious, that is, non-causal paths between $X$ and $Y$, (b) leaves all directed paths from $X$ to $Y$ unblocked, and (c) creates no spurious paths. To see this action, let’s again look at the DAG on the left. The causal effect of $Z$ on $U$ is confounded by $X$, because in addition to the legitimate causal path $Z \rightarrow Y \rightarrow W \rightarrow U$, there is also an unblocked path $Z \leftarrow X \rightarrow W \rightarrow U$ which confounds the causal effect. The backdoor criterion would have us condition on $X$, which blocks the spurious path and renders the causal effect of $Z$ on $U$ unconfounded. Note that conditioning on $W$ would also block this spurious path; however, it would also block the causal path $Z \rightarrow Y \rightarrow W \rightarrow U$. Before moving on, let’s catch a quick breath. We have already discussed a number of very important concepts. At the lowest level of the causal hierarchy — association — we have discovered DAGs and $d$-separation as a powerful tool to reason about conditional (in)dependencies between variables. Moving to intervention, the second level of the causal hierarchy, we have satisfied our need to interpret the arrows in a DAG causally. Doing so required strong assumptions, but it allowed us to go beyond seeing and model the outcome of interventions. This hopefully clarified the notion of confounding. In particular, collider bias is a type of confounding, which has important practical implications: we should not blindly enter all variables into a regression in order to “control” for them, but think carefully about what the underlying causal DAG could look like. Otherwise, we might induce spurious associations. The concepts from causal inference can help us understand methodological phenomena that have been discussed for decades. In the next section, we apply the concepts we have seen so far to make sense of one such phenomenon: Simpson’s Paradox. Example Application: Simpson’s Paradox This section follows the example given in Pearl, Glymour, & Jewell (2016, Ch. 1) with slight modifications. Suppose you observe $N = 700$ patients who either choose to take a drug or not; note that this is not a randomized control trial. The table below shows the number of recovered patients split across sex. The content of this blog post is an extended write-up of a one-hour lecture I gave to $3^{\text{rd}}$ year psychology undergraduate students at the University of Amsterdam. You can view the presentation, which includes exercises at the end, here. ↩ Messerli (2012) was the first to look at this relationship. The data I present here are somewhat different. I include Nobel Laureates up to 2019, and I use the 2017 chocolate consumption data as reported here. You can download the data set here. To get the data reported by Messerli (2012) into R, you can follow this blogpost. ↩ I can recommend this course on causal diagrams by Miguel Hernán to get more intuition for causal graphical models. ↩ If the converse implication, that is, the implication from the distribution to the graph holds, we say that the graph is faithful to the distribution. This is an important assumption in causal learning because it allows one to estimate causal relations from conditional independencies in the data. ↩A brief primer on Variational Inference2019-10-30T12:00:00+00:002019-10-30T12:00:00+00:00https://fabiandablander.com/r/Variational-Inference<link rel="stylesheet" href="../highlight/styles/default.css" />
<script src="../highlight/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<script>$('pre.stan code').each(function(i, block) {hljs.highlightBlock(block);});</script>
<p>Bayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let’s start unpacking!</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>Bayes’ rule states that</p>
<script type="math/tex; mode=display">\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,</script>
<p>where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> Bayes’ rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much <a href="https://xkcd.com/2117/">easier</a> and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization.</p>
<h1 id="variational-objective">Variational objective</h1>
<p>We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution:</p>
<script type="math/tex; mode=display">q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace ,</script>
<p>where $\text{KL}(. \lvert \lvert.)$ denotes the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a>:</p>
<script type="math/tex; mode=display">\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) \, \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace .</script>
<p>We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z} \mid \mathbf{x})\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x})\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \int q(\mathbf{z}) \, \text{log } p(\mathbf{x}) \, \mathrm{d}\mathbf{z} \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x})}_{\text{Nemesis}} \enspace ,
\end{aligned} %]]></script>
<p>where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x}) \right) \\[.5em]
&= -\left(\mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x}) - \text{log } p(\mathbf{x})}_{\text{Nemesis perishes}}\right) \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \enspace .
\end{aligned} %]]></script>
<p>We can expand the joint probability to get more insight into this equation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= \underbrace{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z})\right]}_{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]} - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z})}{q(\mathbf{z})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z})\right) \enspace .
\end{aligned} %]]></script>
<p>This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017).</p>
<p>ELBO stands for <em>evidence lower bound</em>. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x})\right) \\[.5em]
\text{log } p(\mathbf{x}) &= \text{ELBO}(q) + \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \\[.5em]
\text{log } p(\mathbf{x}) &\geq \text{ELBO}(q) \enspace ,
\end{aligned} %]]></script>
<p>since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see <a href="https://fabiandablander.com/r/Law-of-Practice.html">this</a> and <a href="https://fabiandablander.com/r/Bayes-Potter.html">this</a> blog post.</p>
<h1 id="why-variational">Why variational?</h1>
<p>Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the <em>single best value</em> that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">this</a> blog post). In our setting, we do not want to find a single best value but rather a <em>single best function</em>. To do this, we can use <em>variational calculus</em> from which variational inference derives its name (Bishop, 2006, p. 462).</p>
<p>A function takes an input value and returns an output value. We can define a <em>functional</em> which takes a whole function and returns an output value. The <em>entropy</em> of a probability distribution is a widely used functional:</p>
<script type="math/tex; mode=display">\text{H}[p] = \int p(x) \, \text{log } p(x) \mathrm{d} x \enspace ,</script>
<p>which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional.</p>
<p>In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem.</p>
<h1 id="mean-field-variational-family">Mean-field variational family</h1>
<p>A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density:</p>
<script type="math/tex; mode=display">q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j) \enspace .</script>
<p>Note that this <em>mean-field variational family</em> cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem.</p>
<p>We start from our definition of the ELBO and apply the mean-field assumption:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]
&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z}\enspace .
\end{aligned} %]]></script>
<p>In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q_j) &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z} \\[.5em]
&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j - \underbrace{\int \prod_{i\neq j}^m q_i(z_i) \, \text{log} \prod_{i\neq j}^m q_i(z_i) \, \mathrm{d}\mathbf{z}_{-j}}_{\text{Constant with respect to } q_j(z_j)} \\[.5em]
&\propto \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \left(\int \prod_{i\neq j}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z}_{-j}\right)\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \, \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \enspace .
\end{aligned} %]]></script>
<p>One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution</p>
<script type="math/tex; mode=display">\text{log } \tilde{p}{(\mathbf{x}, z_j)} = \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathcal{Z} \enspace ,</script>
<p>where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q_j) &\propto \int q_j(z_j) \, \text{log } \tilde{p}{(\mathbf{x}, z_j)} \, \mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \, \text{log } \frac{\tilde{p}{(\mathbf{x}, z_j)}}{q_j(z_j)} \, \mathrm{d}z_j \\[.5em]
&= -\int q_j(z_j) \, \text{log } \frac{q_j(z_j)}{\tilde{p}{(\mathbf{x}, z_j)}} \, \mathrm{d}z_j \\[.5em]
&= -\text{KL}\left(q_j(z_j) \, \lvert\lvert \, \tilde{p}(\mathbf{x}, z_j)\right) \enspace .
\end{aligned} %]]></script>
<p>Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
q_j^\star(z_j) &= \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right] - \mathcal{Z}\right) \\[.5em]
&= \frac{\text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right)}{\int \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right) \mathrm{d}z_j} \enspace ,
\end{aligned} %]]></script>
<p>see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as <em>Coordinate Ascent Variational Inference</em> (CAVI). Further, note that</p>
<script type="math/tex; mode=display">p(z_j \mid \mathbf{z}_{-j}, \mathbf{x}) = \frac{p(z_j, \mathbf{z}_{-j}, \mathbf{x})}{p(\mathbf{z}_{-j}, \mathbf{x})} \propto p(z_j, \mathbf{z}_{-j}, \mathbf{x}) \enspace ,</script>
<p>which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks <em>a lot</em> like Gibbs sampling, which we discussed in detail in a <a href="https://fabiandablander.com/r/Spike-and-Slab.html">previous</a> blog post. In the next section, we implement CAVI for a simple linear regression problem.</p>
<h1 id="application-linear-regression">Application: Linear regression</h1>
<p>In a <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">previous</a> blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
y &\sim \mathcal{N}(\beta x, \sigma^2) \\[.5em]
\beta &\sim \mathcal{N}(0, \sigma^2 \tau^2) \\[.5em]
\sigma^2 &\propto \frac{1}{\sigma^2} \enspace ,
\end{aligned} %]]></script>
<p>where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys’ prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification:</p>
<script type="math/tex; mode=display">\text{Var}\left[\frac{\beta}{\sigma}\right] = \frac{1}{\sigma^2} \text{Var}[\beta] = \frac{\sigma^2 \tau^2}{\sigma^2} = \tau^2 \enspace .</script>
<p>As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$.</p>
<h2 id="variational-density-for-sigma2">Variational density for $\sigma^2$</h2>
<p>Our optimal variational density $q^\star(\sigma^2)$ is given by:</p>
<script type="math/tex; mode=display">q^\star(\sigma^2) \propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \enspace .</script>
<p>To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\sigma^2 \mid \mathbf{y}, \beta) &\propto p(\mathbf{y} \mid \sigma^2, \beta) \, p(\beta) \, p(\sigma^2) \\[.5em]
&= \prod_{i=1}^n (2\pi)^{-\frac{1}{2}} \left(\sigma^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2} \left(y_i - \beta x_i\right)^2\right) \underbrace{(2\pi)^{-\frac{1}{2}} \left(\sigma^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}_{p(\beta)} \underbrace{\left(\sigma^2\right)^{-1}}_{p(\sigma^2)} \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&\propto\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{2\sigma^2} \underbrace{\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)}_{A}\right) \enspace ,
\end{aligned} %]]></script>
<p>which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
q^\star(\sigma^2) &\propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{2\sigma^2}A \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1}\right] - \mathbb{E}_{q(\beta)}\left[\frac{1}{2\sigma^2}A \right]\right) \\[.5em]
&= \text{exp}\left(\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \\[.5em]
&= \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \enspace .
\end{aligned} %]]></script>
<p>This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at:</p>
<script type="math/tex; mode=display">q^\star(\sigma^2)= \frac{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \underbrace{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]}_{\nu}\right) \enspace .</script>
<p>Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$.</p>
<h2 id="variational-density-for-beta">Variational density for $\beta$</h2>
<p>Our optimal variational density $q^\star(\beta)$ is given by:</p>
<script type="math/tex; mode=display">q^\star(\beta) \propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \enspace ,</script>
<p>and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \sigma^2) &\propto p(\mathbf{y} \mid \beta, \sigma^2) \, p(\beta) \, p(\sigma^2) \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^ny_i^2- 2 \beta \sum_{i=1}^n y_i x_i + \beta^2 \sum_{i=1}^n x_i^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&\propto \text{exp}\left(-\frac{1}{2\sigma^2} \left( \beta^2 \left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) - 2 \beta \sum_{i=1}^n y_i x_i\right)\right) \\[.5em]
&=\text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta^2 - 2 \beta \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)\right) \\[.5em]
&\propto \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,
\end{aligned} %]]></script>
<p>where we have “completed the square” (see also <a href="https://fabiandablander.com/statistics/Two-Properties.html">this</a> blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
q^\star(\beta) &\propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right]\right) \\[.5em]
&= \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,
\end{aligned} %]]></script>
<p>which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields:</p>
<script type="math/tex; mode=display">q^\star(\beta) = \left(2\pi\underbrace{\frac{\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}}_{\sigma^2_{\beta}}\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left(\beta - \underbrace{\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}}_{\mu_{\beta}}\right)^2\right) \enspace ,</script>
<p>Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not.</p>
<p>To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys’s prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations.</p>
<h2 id="removing-expectations">Removing expectations</h2>
<p>Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, to remove the expectation in the variational density for $\sigma^2$, we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\beta)}\left[A \right] &= \mathbb{E}_{q(\beta)}\left[\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right] \\[.5em]
&= \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mathbb{E}_{q(\beta)}\left[\beta\right] + \sum_{i=1}^n x_i^2 \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] + \frac{1}{\tau^2} \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] \enspace .
\end{aligned} %]]></script>
<p>Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q(\beta)}[\beta^2] = \text{Var}_{q(\beta)}\left[\beta\right] + \mathbb{E}_{q(\beta)}[\beta]^2
= \sigma^2_{\beta} + \mu_{\beta}^2 \enspace ,</script>
<p>the expectation becomes:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q(\beta)}\left[A\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) \enspace .</script>
<p>For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] &= \int \frac{1}{\sigma^2}\frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right) \mathrm{d}\sigma^2\\[0.50em]
&= \frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)} \int \left(\sigma^2\right)^{-\left(\frac{n + 1}{2} + 1\right) - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right) \mathrm{d}\sigma^2 \\[0.50em]
&= \frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)} \frac{\Gamma\left(\frac{n + 1}{2} + 1\right)}{\nu^{\frac{n + 1}{2} + 1}} \\[0.50em]
&= \frac{n + 1}{2} \left(\frac{1}{2}\mathbb{E}_{q(\beta)}\left[A \right]\right)^{-1} \\[.5em]
&= \frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \enspace .
\end{aligned} %]]></script>
<h2 id="monitoring-convergence">Monitoring convergence</h2>
<p>The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y}, \beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2) \right] \\[.5em]
&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \mathbb{E}_{p(\beta, \sigma^2)}\left[\text{log } p(\beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2)\right] \\[.5em]
&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \underbrace{\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right]}_{-\text{KL}\left(q(\beta, \sigma^2) \, \lvert\lvert \, p(\beta, \sigma^2)\right)}\enspace .
\end{aligned} %]]></script>
<p>Let’s take a deep breath and tackle the second term first:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{p(\beta \mid \sigma^2)}{q(\beta)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\left(2\pi\sigma^2\tau^2\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}{\left(2\pi\sigma^2_\beta\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2_\beta} (\beta - \mu_\beta)^2\right)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\frac{1}{\sigma^2\tau^2} \beta^2}{\frac{1}{\sigma^2_\beta} (\beta - \mu_\beta)^2}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\text{log}\frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\sigma^2_\beta + \mu_\beta^2}{\sigma^2\tau^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \text{log}\frac{\tau^2}{\sigma^2_\beta} + \mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \text{log}\frac{\tau^2}{\sigma^2_\beta} + \mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\sigma^2)\right] - \mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]\enspace .
\end{aligned} %]]></script>
<p>Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration:</p>
<!-- We proceed by expanding the last expectation: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{\sigma^{-2}}{\frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right)}\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{1}{\left(\sigma^2\right)^{-\frac{n + 1}{2}}} - \frac{\sigma^2}{\nu}\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\nu}\mathbb{E}_{q(\sigma^2)}\left[\sigma^2\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\nu} \frac{\nu}{\frac{n + 1}{2} - 1} \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\frac{n + 1}{2} - 1} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<script type="math/tex; mode=display">\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right] = \int q(\sigma^2) \, \text{log } q(\sigma^2) \, \mathrm{d}\sigma^2 \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\text{log } q(\sigma^2)}_{\sigma^2 \, \sim \, q(\sigma^2)} \enspace ,</script>
<p>We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] &= \int \int q(\beta) \, q(\sigma^2) \, \text{log } p(\mathbf{y} \mid \beta, \sigma^2) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]
&=\int q(\beta) \int q(\sigma^2) \, \text{log} \left(\left(2\pi\sigma^2\right)^{-\frac{n}{2}}\text{exp}\left(-\frac{1}{2\sigma^2}
\sum_{i=1}^n (y_i - x_i\beta)^2\right)\right) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]
&= \frac{n}{4} \text{log}\left(2\pi\right)\int q(\beta) \left(\sum_{i=1}^n (y_i - x_i\beta)^2\right) \, \mathrm{d}\beta\int q(\sigma^2) \, \, \text{log} \left(\sigma^2\right)\frac{1}{\sigma^2} \, \mathrm{d}\sigma \enspace .
\end{aligned} %]]></script>
<p>Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q(\beta)}\left[\sum_{i=1}^n (y_i - x_i\beta)^2\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right) \enspace .</script>
<!-- Piecing it all together, the ELBO is given by: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{ELBO}(\mu_\beta, \sigma_\beta^2, \tau^2, \tau^2) &= \frac{n}{4} \text{log}\left(2\pi\right)\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right)\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log} \left(\sigma^2\right)\frac{1}{\sigma^2}\right]\\[.5em] -->
<!-- &+ \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \\[.5em] -->
<!-- &+ \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\frac{n + 1}{2} - 1} \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<p>In the next section, we implement the algorithm for our linear regression problem in R.</p>
<h1 id="implementation-in-r">Implementation in R</h1>
<p>Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'MCMCpack'</span><span class="p">)</span><span class="w">
</span><span class="cd">#' Computes the ELBO for the linear regression example</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @param y univariate outcome variable</span><span class="w">
</span><span class="cd">#' @param x univariate predictor variable</span><span class="w">
</span><span class="cd">#' @param beta_mu mean of the variational density for \beta</span><span class="w">
</span><span class="cd">#' @param beta_sd standard deviation of the variational density for \beta</span><span class="w">
</span><span class="cd">#' @param nu parameter of the variational density for \sigma^2</span><span class="w">
</span><span class="cd">#' @param nr_samples number of samples for the Monte carlo integration</span><span class="w">
</span><span class="cd">#' @returns ELBO</span><span class="w">
</span><span class="n">compute_elbo</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e4</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_y2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_yx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="c1"># Takes a function and computes its expectation with respect to q(\beta)</span><span class="w">
</span><span class="n">E_q_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dnorm</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">fn</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="o">-</span><span class="kc">Inf</span><span class="p">,</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Takes a function and computes its expectation with respect to q(\sigma^2)</span><span class="w">
</span><span class="n">E_q_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">fn</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Compute expectations of log p(\sigma^2)</span><span class="w">
</span><span class="n">E_log_p_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="c1"># Compute expectations of log p(\beta \mid \sigma^2)</span><span class="w">
</span><span class="n">E_log_p_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="w">
</span><span class="nf">log</span><span class="p">(</span><span class="n">tau2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="p">(</span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">tau2</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">tau2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># Compute expectations of the log variational densities q(\beta)</span><span class="w">
</span><span class="n">E_log_q_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_beta</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">log</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="c1"># E_log_q_sigma2 <- E_q_sigma2(function(x) log(dinvgamma(x, (n + 1)/2, nu))) # fails</span><span class="w">
</span><span class="c1"># Compute expectations of the log variational densities q(\sigma^2)</span><span class="w">
</span><span class="n">sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rinvgamma</span><span class="p">(</span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w">
</span><span class="n">E_log_q_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">sigma2</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)))</span><span class="w">
</span><span class="c1"># Compute the expected log likelihood</span><span class="w">
</span><span class="n">E_log_y_b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_y2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">sum_yx</span><span class="o">*</span><span class="n">beta_mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_mu</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="n">sum_x2</span><span class="w">
</span><span class="n">E_log_y_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">E_log_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">4</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_log_y_b</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_log_y_sigma2</span><span class="w">
</span><span class="c1"># Compute and return the ELBO</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_log_y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E_log_p_beta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E_log_p_sigma2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">E_log_q_beta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">E_log_q_sigma2</span><span class="w">
</span><span class="n">ELBO</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The function below implements coordinate ascent mean-field variational inference for our simple linear regression problem. Recall that the variational parameters are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\nu &= \frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right) \\[.5em]
\mu_\beta &= \frac{\sum_{i=1}^N y_i x_i}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \\[.5em]
\sigma^2_\beta &= \frac{\left(\frac{n + 1}{2}\right) \nu^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \enspace .
\end{aligned} %]]></script>
<p>The following function implements the iterative updating of these variational parameters until the ELBO has converged.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="cd">#' Implements CAVI for the linear regression example</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @param y univariate outcome variable</span><span class="w">
</span><span class="cd">#' @param x univariate predictor variable</span><span class="w">
</span><span class="cd">#' @param tau2 prior variance for the standardized effect size</span><span class="w">
</span><span class="cd">#' @returns parameters for the variational densities and ELBO</span><span class="w">
</span><span class="n">lmcavi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e5</span><span class="p">,</span><span class="w"> </span><span class="n">epsilon</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e-2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_y2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_yx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="c1"># is not being updated through variational inference!</span><span class="w">
</span><span class="n">beta_mu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_yx</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_mu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">beta_mu</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">has_converged</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">epsilon</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">compute_elbo</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w">
</span><span class="c1"># while the ELBO has not converged</span><span class="w">
</span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">has_converged</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]][</span><span class="n">j</span><span class="p">],</span><span class="w"> </span><span class="n">ELBO</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">nu_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]][</span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="n">beta_sd_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]][</span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="c1"># used in the update of beta_sd and nu</span><span class="w">
</span><span class="n">E_qA</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_y2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">sum_yx</span><span class="o">*</span><span class="n">beta_mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">beta_sd_prev</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_mu</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">)</span><span class="w">
</span><span class="c1"># update the variational parameters for sigma2 and beta</span><span class="w">
</span><span class="n">nu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_qA</span><span class="w">
</span><span class="n">beta_sd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(((</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">E_qA</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">))</span><span class="w">
</span><span class="c1"># update results object</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]],</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]],</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]],</span><span class="w"> </span><span class="n">ELBO</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute new ELBO</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">compute_elbo</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Let’s run this on a simulated data set of size $n = 100$ with a true coefficient of $\beta = 0.30$ and a true error variance of $\sigma^2 = 1$. We assign $\beta$ a Gaussian prior with variance $\tau^2 = 0.25$ so that values for $\lvert \beta \rvert$ larger than two standard deviations ($0.50$) <a href="(https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)">receive about $0.68$</a> prior probability.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gen_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="o">*</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gen_dat</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">0.30</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">mc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lmcavi</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">mc</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## $nu
## [1] 5.00000 88.17995 45.93875 46.20205 46.19892 46.19895
##
## $beta_mu
## [1] 0.2800556
##
## $beta_sd
## [1] 1.00000000 0.08205605 0.11368572 0.11336132 0.11336517 0.11336512
##
## $ELBO
## [1] 0.0000 -297980.0495 493.4807 -281.4578 -265.1289 -265.3197</code></pre></figure>
<p>From the output, we see that the ELBO and the variational parameters have converged. In the next section, we compare these results to results obtained with Stan.</p>
<h2 id="comparison-with-stan">Comparison with Stan</h2>
<p>Whenever one goes down a rabbit hole of calculations, it is good to sanity check one’s results. Here, we use Stan’s variational inference scheme to check whether our results are comparable. It assumes a Gaussian variational density for each parameter after transforming them to the real line and automates inference in a “black-box” way so that no problem-specific calculations are required (see Kucukelbir, Ranganath, Gelman, & Blei, 2015). Subsequently, we compare our results to the exact posteriors arrived by Markov chain Monte carlo. The simple linear regression model in Stan is:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">data {
int<lower=0> n;
vector[n] y;
vector[n] x;
real tau;
}
parameters {
real b;
real<lower=0> sigma;
}
model {
target += -log(sigma);
target += normal_lpdf(b | 0, sigma*tau);
target += normal_lpdf(y | b*x, sigma);
}</code></pre></figure>
<p>We use Stan’s black-box variational inference scheme:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'rstan'</span><span class="p">)</span><span class="w">
</span><span class="c1"># save the above model to a file and compile it</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stan_model</span><span class="p">(</span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'stan-compiled/variational-regression.stan'</span><span class="p">)</span><span class="w">
</span><span class="n">stan_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">dat</span><span class="p">),</span><span class="w"> </span><span class="s1">'x'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="s1">'tau'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vb</span><span class="p">(</span><span class="w">
</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_dat</span><span class="p">,</span><span class="w"> </span><span class="n">output_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20000</span><span class="p">,</span><span class="w"> </span><span class="n">adapt_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w">
</span><span class="n">init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'b'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.30</span><span class="p">,</span><span class="w"> </span><span class="s1">'sigma'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>This gives similar estimates as ours:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Inference for Stan model: variational-regression.
## 1 chains, each with iter=20000; warmup=0; thin=1;
## post-warmup draws per chain=20000, total post-warmup draws=20000.
##
## mean sd 2.5% 25% 50% 75% 97.5%
## b 0.28 0.13 0.02 0.19 0.28 0.37 0.54
## sigma 0.99 0.09 0.82 0.92 0.99 1.05 1.18
## lp__ 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## Approximate samples were drawn using VB(meanfield) at Thu Mar 19 10:45:28 2020.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## We recommend genuine 'sampling' from the posterior distribution for final inferences!</code></pre></figure>
<p>Their recommendation is prudent. If you run the code with different seeds, you can get quite different results. For example, the posterior mean of $\beta$ can range from $0.12$ to $0.45$, and the posterior standard deviation can be as low as $0.03$; in all these settings, Stan indicates that the ELBO has converged, but it seems that it has converged to a different local optimum for each run. (For seed = 3, Stan gives completely nonsensical results). Stan warns that the algorithm is experimental and may be unstable, and it is probably wise to not use it in production.</p>
<p><em>Update</em>: As Ben Goodrich points out in the comments, there is some cool work on providing diagnostics for variational inference; see <a href="https://statmodeling.stat.columbia.edu/2018/06/27/yes-work-evaluating-variational-inference/">this</a> blog post and the paper by Yao, Vehtari, Simpson, & Gelman (<a href="https://arxiv.org/abs/1802.02538">2018</a>) as well as the paper by Huggins, Kasprzak, Campbell, & Broderik (<a href="https://arxiv.org/abs/1910.04102">2019</a>).</p>
<p>Although the posterior distribution for $\beta$ and $\sigma^2$ is available in closed-form (see the <em>Post Scriptum</em>), we check our results against exact inference using Markov chain Monte carlo by visual inspection.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_dat</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<p>The Figure below overlays our closed-form results to the histogram of posterior samples obtained using Stan.</p>
<p><img src="/assets/img/2019-10-30-Variational-Inference.Rmd/unnamed-chunk-10-1.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" style="display: block; margin: auto;" /></p>
<p>Note that the posterior variance of $\beta$ is slightly <em>overestimated</em> when using our variational scheme. This is in contrast to the fact that variational inference generally <em>underestimates</em> variances. Note also that Bayesian inference using Markov chain Monte Carlo is very fast on this simple problem. However, the comparative advantage of variational inference becomes clear by increasing the sample size: for sample sizes as large as $n = 100000$, our variational inference scheme takes less then a tenth of a second!</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen how to turn an integration problem into an optimization problem using variational inference. Assuming that the variational densities are independent, we have derived the optimal variational densities for a simple linear regression problem with one predictor. While using variational inference for this problem is unnecessary since everything is available in closed-form, I have focused on such a simple problem so as to not confound this introduction to variational inference by the complexity of the model. Still, the derivations were quite lengthy. They were also entirely specific to our particular problem, and thus generic “black-box” algorithms which avoid problem-specific calculations hold great promise.</p>
<p>We also implemented coordinate ascent mean-field variational inference (CAVI) in R and compared our results to results obtained via variational and exact inference using Stan. We have found that one probably should not trust Stan’s variational inference implementation, and that our results closely correspond to the exact procedure. For more on variational inference, I recommend the excellent review article by Blei, Kucukelbir, and McAuliffe (<a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773">2017</a>).</p>
<hr />
<p><em>I would like to thank Don van den Bergh for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="normal-inverse-gamma-distribution">Normal-inverse-gamma Distribution</h3>
<p>The posterior distribution is a <a href="https://en.wikipedia.org/wiki/Normal-inverse-gamma_distribution">Normal-inverse-gamma distribution</a>:</p>
<script type="math/tex; mode=display">p(\beta, \sigma^2 \mid \mathbf{y}) = \frac{\gamma^{\alpha}}{\Gamma\left(\alpha\right)} \left(\sigma^2\right)^{-\alpha - 1} \text{exp}\left(-\frac{2\gamma + \lambda\left(\beta - \mu\right)^2}{2\sigma^2}\right) \enspace ,</script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mu &= \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}} \\[.5em]
\lambda &= \sum_{i=1}^n x_i + \frac{1}{\tau^2} \\[.5em]
\alpha &= \frac{n + 1}{2} \\[.5em]
\gamma &= \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2 - \frac{\left(\sum_{i=1}^n y_i x_i\right)^2}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}}\right)\right) \enspace .
\end{aligned} %]]></script>
<p>Note that the marginal posterior distribution for $\beta$ is actually a Student-t distribution, contrary to what we assume in our variational inference scheme.</p>
<h2 id="references">References</h2>
<ul>
<li>Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (<a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773">2017</a>). Variational inference: A review for statisticians. <em>Journal of the American Statistical Association, 112</em>(518), 859-877.</li>
<li>Huggins, J. H., Kasprzak, M., Campbell, T., & Broderick, T. (<a href="https://arxiv.org/abs/1910.04102">2019</a>). Practical Posterior Error Bounds from Variational Objectives. <em>arXiv preprint</em> arXiv:1910.04102.</li>
<li>Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (<a href="http://papers.nips.cc/paper/5758-automatic-variational-inference-in-stan">2015</a>). Automatic variational inference in Stan. In <em>Advances in Neural Information Processing Systems</em> (pp. 568-576).</li>
<li>Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (<a href="http://www.jmlr.org/papers/volume18/16-107/16-107.pdf">2017</a>). Automatic differentiation variational inference. <em>The Journal of Machine Learning Research, 18</em>(1), 430-474.</li>
<li>Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (<a href="https://arxiv.org/abs/1802.02538">2018</a>). Yes, but did it work?: Evaluating variational inference. <em>arXiv preprint</em> arXiv:1802.02538.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderBayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let’s start unpacking! Preliminaries Bayes’ rule states that where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.1 Bayes’ rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much easier and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization. Variational objective We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution: where $\text{KL}(. \lvert \lvert.)$ denotes the Kullback-Leibler divergence: We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that: where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity: We can expand the joint probability to get more insight into this equation: This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017). ELBO stands for evidence lower bound. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence: since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see this and this blog post. Why variational? Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the single best value that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example this blog post). In our setting, we do not want to find a single best value but rather a single best function. To do this, we can use variational calculus from which variational inference derives its name (Bishop, 2006, p. 462). A function takes an input value and returns an output value. We can define a functional which takes a whole function and returns an output value. The entropy of a probability distribution is a widely used functional: which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional. In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem. Mean-field variational family A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density: Note that this mean-field variational family cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem. We start from our definition of the ELBO and apply the mean-field assumption: In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed: One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that: Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by: see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as Coordinate Ascent Variational Inference (CAVI). Further, note that which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks a lot like Gibbs sampling, which we discussed in detail in a previous blog post. In the next section, we implement CAVI for a simple linear regression problem. Application: Linear regression In a previous blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is: where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys’ prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification: As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$. Variational density for $\sigma^2$ Our optimal variational density $q^\star(\sigma^2)$ is given by: To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write: which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write: This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at: Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$. Variational density for $\beta$ Our optimal variational density $q^\star(\beta)$ is given by: and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write: where we have “completed the square” (see also this blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations: which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields: Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not. To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys’s prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations. Removing expectations Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, to remove the expectation in the variational density for $\sigma^2$, we write: Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that: the expectation becomes: For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. We write: Monitoring convergence The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write: Let’s take a deep breath and tackle the second term first: Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration: We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient: Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically: In the next section, we implement the algorithm for our linear regression problem in R. Implementation in R Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO: The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation. ↩Harry Potter and the Power of Bayesian Constrained Inference2019-09-28T08:00:00+00:002019-09-28T08:00:00+00:00https://fabiandablander.com/r/Bayes-Potter<p>If you are reading this, you are probably a Ravenclaw. Or a Hufflepuff. Certainly not a Slytherin … but maybe a Gryffindor?</p>
<p>In this blog post, we let three subjective Bayesians predict the outcome of ten coin flips. We will derive prior predictions, evaluate their accuracy, and see how fortune favours the bold. We will also discover a neat trick that allows one to easily compute Bayes factors for models with parameter restrictions compared to models without such restrictions, and use it to answer a question we truly care about: are Slytherins really the bad guys?</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>As in a <a href="https://fabiandablander.com/r/Regularization.html">previous blog post</a>, we start by studying coin flips. Let $\theta \in [0, 1]$ be the bias of the coin and let $y$ denote the number of heads out of $n$ coin flips. We use the Binomial likelihood</p>
<script type="math/tex; mode=display">p(y \mid \theta) = {n \choose y} \theta^y (1 - \theta)^{n - y} \enspace ,</script>
<p>and a Beta prior for $\theta$:</p>
<script type="math/tex; mode=display">p(\theta) = \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \enspace .</script>
<p>This prior is <em>conjugate</em> for this likelihood which means that the posterior is again a Beta distribution. The Figure below shows two examples of this.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>In this blog post, we will use a <em>prior predictive</em> perspective on model comparison by means of Bayes factors. For an extensive contrast with a perspective based on <em>posterior prediction</em>, see <a href="https://fabiandablander.com/r/Law-of-Practice.html">this blog post</a>. The Bayes factor indicates how much better a model $\mathcal{M}_1$ predicts the data $y$ <em>relative to another model</em> $\mathcal{M}_0$:</p>
<script type="math/tex; mode=display">\text{BF}_{10} = \frac{p(y \mid \mathcal{M}_1)}{p(y \mid \mathcal{M}_0)} \enspace ,</script>
<p>where we can write the <em>marginal likelihood</em> of a generic model $\mathcal{M}$ more complicatedly to see the dependence on the model’s priors:</p>
<script type="math/tex; mode=display">p(y \mid \mathcal{M}) = \int_{\Theta} p(y \mid \theta, \mathcal{M}) \, p(\theta \mid \mathcal{M}) \, \mathrm{d}\theta \enspace .</script>
<p>After these preliminaries, in the next section, we visit Ron, Harry, and Hermione in Hogwarts.</p>
<h1 id="the-hogwarts-prediction-contest">The Hogwarts prediction contest</h1>
<p>Ron, Harry, and Hermione just came back from a straining adventure — Death Eaters and all. They deserve a break, and Hermione suggests a small prediction contest to relax. Ron is put off initially; relaxing by thinking? That’s not his style. Harry does not care either way; both are eventually convinced.</p>
<p>The goal of the contest is to accuratly predict the outcome of $n = 10$ coin flips. Luckily, this is not a particularly complicated problem to model, and we can use the Binomial likelihood we have discussed above. In the next section, Ron, Harry, and Hermione — all subjective Bayesians — clearly state their prior beliefs which is required to make predictions.</p>
<h2 id="prior-beliefs">Prior beliefs</h2>
<p>Ron is not big on thinking, and so trusts his previous intuitions that coins are usually unbiased; he specifies a point mass on $\theta = 0.50$ as his prior. Harry spreads his bets evenly, and believes that all chances governing the coin flip’s outcome are equally likely; he puts a uniform prior on $\theta$. Hermione, on the other hand, believes that the coin <em>cannot</em> be biased towards tails; instead, she believes that all values $\theta \in [0.50, 1]$ are equally likely. She thinks this because Dobby — the house elf — is the one who throws the coin, and she has previously observed him passing time by flipping coins, which strangely almost always landed up heads. To sum up, their priors are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{Ron} &: \theta = 0.50 \\[.5em]
\text{Harry} &: \theta \sim \text{Beta}(1, 1) \\[.5em]
\text{Hermione} &: \theta \sim \text{Beta}(1, 1)\mathbb{I}(0.50, 1) \enspace ,
\end{aligned} %]]></script>
<p>which are visualized in the Figure below.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In the next section, the three use their beliefs to make probabilistic predictions.</p>
<h2 id="prior-predictions">Prior predictions</h2>
<p>Ron, Harry, and Hermione are subjective Bayesians and therefore evaluate their performance by their respective predictive accuracy. Each of the trio has a <em>prior predictive distribution</em>. For Ron, true to character, this is the easiest to derive. We associate model $\mathcal{M}_0$ with him and write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y \mid \mathcal{M}_0) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_0) \, p(\theta \mid \mathcal{M}_0) \, \mathrm{d}\theta \\[.5em]
&= {n \choose y} 0.50^y (1 - 0.50)^{n - y} \enspace ,
\end{aligned} %]]></script>
<p>where the integral — the sum! — is just over the value $\theta = 0.50$. Ron’s prior predictive distribution is simply a Binomial distribution. He is delighted by this fact, and enjoys a short rest while the others derive their predictions.</p>
<p>It is Harry’s turn, and he is a little put off by his integration problem. However, he realizes that the integrand is an unnormalized Beta distribution, and swiftly writes down its normalizing constant, the Beta function. Associating $\mathcal{M}_1$ with him, his steps are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y \mid \mathcal{M}_1) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \frac{1}{\text{B}(1, 1)} \theta^{1 - 1} (1 - \theta)^{1 - 1} \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \mathrm{d}\theta \\[.5em]
&= {n \choose y} \text{Beta}(y + 1, n - y + 1) \enspace ,
\end{aligned} %]]></script>
<p>which is a <a href="https://en.wikipedia.org/wiki/Beta-binomial_distribution">Beta-Binomial distribution</a> with $\alpha = \beta = 1$.</p>
<p>Hermione’s integral is the most complicated of the three, but she is also the smartest of the bunch. She is a master of the wizardry that is computer programming, which allows her to solve the integral numerically.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> We associate $\mathcal{M}_r$, which stands for <em>restricted</em> model, with her and write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y \mid \mathcal{M}_r) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta \\[.5em]
&= \int_{0.50}^1 {n \choose y} \theta^y (1 - \theta)^{n - y} \, 2 \, \mathrm{d}\theta \\[.5em]
&= 2{n \choose y}\int_{0.50}^1 \theta^y (1 - \theta)^{n - y} \mathrm{d}\theta \enspace .
\end{aligned} %]]></script>
<p>We can draw from the prior predictive distributions by simulating from the prior and then making predictions through the likelihood. For Hermione, for example, this yields:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">nr_draws</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">20</span><span class="w">
</span><span class="n">theta_Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_draws</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">predictions_Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_draws</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">theta_Hermione</span><span class="p">)</span><span class="w">
</span><span class="n">predictions_Hermione</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 10 10 10 3 7 10 8 9 6 9 9 6 8 9 8 10 6 10 5 7</code></pre></figure>
<p>Let’s visualize Ron’s, Harry’s, and Hermione’s prior predictive distributions to get a better feeling for what they believe are likely coin flip outcomes. First, we implement their prior predictions in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Ron</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">0.50</span><span class="o">^</span><span class="n">n</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Harry</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">beta</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">int</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="n">theta</span><span class="o">^</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta</span><span class="p">)</span><span class="o">^</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">int</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Even though Ron believes that $\theta = 0.50$, this does not mean that his prior prediction puts all mass on $y = 5$; deviations from this value are plausible. Harry’s prior predictive distribution also makes sense: since he believes all values for $\theta$ to be equally likely, he should believe all outcomes are equally likely. Hermione, on the other hand, believes that $\theta \in [0.50, 1]$, so her prior probabilities for outcomes with few heads ($y < 5$) drastically decrease.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>After the three have clearly stated their prior beliefs and derived their prior predictions, Dobby throws a coin ten times. The coin comes up heads nine times. In the next section, we discuss the relative predictive performance of Ron, Harry, and Hermione based on these data.</p>
<h2 id="evaluating-predictions">Evaluating predictions</h2>
<p>To assess the relative predictive performance of Ron, Harry, and Hermione, we need to compute the probability mass of $y = 9$ for their respective prior predictive distributions. Compared to Ron, Hermione did roughly 19 times better:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Hermione</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Ron</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 18.50909</code></pre></figure>
<p>Harry, on the other hand, did about 9 times better than Ron:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Harry</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Ron</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 9.309091</code></pre></figure>
<p>With these two comparisons, we also know by how much Hermione outperformed Harry, since by transitivity we have:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_0)} \times \frac{p(y \mid \mathcal{M}_0)}{p(y \mid \mathcal{M}_1)} = \text{BF}_{r0} \times \frac{1}{\text{BF}_{10}} \approx 2 \enspace ,</script>
<p>which is indeed correct:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Hermione</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Harry</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1.988281</code></pre></figure>
<p>Note that this is also immediately apparent from the visualizations above, where Hermione’s allocated probability mass is about twice as large as Harry’s for the case where $y = 9$.</p>
<p>Hermione was bold in her prediction, and was rewarded with being favoured by a factor of two in predictive performance. Note that if her predictions would have been even bolder, say restricting her prior to $\theta \in [0.80, 1]$, she would have reaped higher rewards than a Bayes factor in favour of two. Contrast this to Dobby throwing the coin ten times and with only one heads showing. Then Harry’s marginal likelihood is still $\text{Beta}(11, 1) = \frac{1}{11}$. However, Hermione’s is not twice as much; instead, it is a mere $0.001065$, which would result in a Bayes factor of about $85$ in favour of Harry!</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Harry</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Hermione</span><span class="p">(</span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 85.33333</code></pre></figure>
<p>This means that with bold predictions, one can also lose a lot. However, this is tremendously insightful, since Hermione would immediately realize where she went wrong. For a discussion that also points out the flexibility of Bayesian model comparison, see Etz, Haaf, Rouder, & Vandeckerckhove (2018).</p>
<p>In the next section, we will discover a nice trick which simplifies the computation of the Bayes factor; we do not need to derive marginal likelihoods, but can simply look at the prior and the posterior distribution of the parameter of interest in the unrestricted model.</p>
<h1 id="prior--posterior-trick">Prior / Posterior trick</h1>
<p>As it it turns out, the relative predictive performance of Hermione compared to Harry is given by the ratio of the purple area to the blue area in the figure below.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-10-1.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" style="display: block; margin: auto;" /></p>
<p>In other words, the Bayes factor in favour of the <em>restricted</em> model (i.e., Hermione) compared to the <em>unrestricted</em> or <em>encompassing</em> model (i.e., Harry) is given by the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. We can check this numerically:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># (1 - pbeta(0.50, 10, 2)) / 0.50 would also work</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="n">dbeta</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">0.50</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1.988281</code></pre></figure>
<p>This is a very cool result which, to my knowledge, was first described in Kluglist & Hoijtink (2005). In the next section, we prove it.</p>
<h2 id="proof">Proof</h2>
<p>The proof uses two insights. First, note that we can write the priors in the restricted model, $\mathcal{M}_r$, as priors in the encompassing model, $\mathcal{M}_1$, subject to some constraints. In the Hogwarts prediction context, Hermione’s prior was a restricted version of Harry’s prior:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\theta \mid \mathcal{M}_r) &= p(\theta \mid \mathcal{M}_1)\mathbb{I}(0.50, 1) \\[1em]
&= \begin{cases} \frac{p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} & \text{if} \hspace{1em} \theta \in [0.50, 1] \\[1em] 0 & \text{otherwise}\end{cases}
\end{aligned} %]]></script>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>We have to divide by the term</p>
<script type="math/tex; mode=display">K = \int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta = 0.50 \enspace ,</script>
<p>so that the restricted prior integrates to 1, as all proper probability distributions must. As a direct consequence, note that the density of a value $\theta = \theta^{\star}$ is given by:</p>
<script type="math/tex; mode=display">p(\theta^{\star} \mid \mathcal{M}_r) = p(\theta^{\star} \mid \mathcal{M}_1) \cdot \frac{1}{K} \enspace ,</script>
<p>where $K$ is the renormalization constant. This means that we can rewrite terms which include the restricted prior in terms of the unrestricted prior from the encompassing model. This also holds for the posterior!</p>
<p>To see that we can also write the restricted posterior in terms of the unrestricted posterior from the encompassing model, note that the likelihood is the same under both models and that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\theta \mid y, \mathcal{M}_r) &= \frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K}}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K} \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \\[.5em]
&= \frac{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta}}{\int_{0.50}^1 \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(\theta \mid y, \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta} \enspace ,
\end{aligned} %]]></script>
<p>where we have to renormalize by</p>
<script type="math/tex; mode=display">Z = \int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta \enspace ,</script>
<p>which is</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pbeta</span><span class="p">(</span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 0.9941406</code></pre></figure>
<p>The figure below visualizes Harry’s and Hermione’s posterior. Sensibly, since Hermione excluded all $\theta \in [0, 0.50]$ in her prior, such values receive zero credence in her posterior. However, the difference in posterior distributions between Harry and Hermione is very weak in contrast to the difference in prior distribution. This is reflected in $Z$ being close to 1.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-14-1.png" title="plot of chunk unnamed-chunk-14" alt="plot of chunk unnamed-chunk-14" style="display: block; margin: auto;" /></p>
<p>Similar to the prior, we can write the density of a value $\theta = \theta^\star$ in terms of the encompassing model:</p>
<script type="math/tex; mode=display">p(\theta^{\star} \mid y, \mathcal{M}_r) = p(\theta^{\star} \mid y, \mathcal{M}_1) \cdot \frac{1}{Z} \enspace .</script>
<p>Now that we have established that we can write both the prior and the posterior density of parameters in the restricted model in terms of the parameters in the unrestricted model, as a second step, note that we can swap the posterior and the marginal likelihood terms in Bayes’ rule such that:</p>
<script type="math/tex; mode=display">p(y \mid \mathcal{M}_1) = \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)} \enspace ,</script>
<p>from which it follows that:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_1)} = \frac{\frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{p(\theta \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)}} \enspace .</script>
<p>Now suppose that we have values that are in line with the restriction, i.e., $\theta = \theta^{\star}$. Then:</p>
<script type="math/tex; mode=display">\begin{aligned}
\text{BF}_{r1} = \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star\mid \mathcal{M}_r)}{p(\theta^\star \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}
= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star \mid \mathcal{M}_1) \, \frac{1}{K}}{p(\theta^\star \mid y, \mathcal{M}_1) \, \frac{1}{Z}}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}
= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, \frac{1}{K}}{\frac{1}{Z}}}{p(y \mid \theta^\star, \mathcal{M}_1)} = \frac{\frac{1}{K}}{\frac{1}{Z}} = \frac{Z}{K} \enspace ,
\end{aligned}</script>
<p>where we have used the previous insights and the fact that the likelihood under $\mathcal{M}_r$ and $\mathcal{M}_1$ is the same. If we expand the constants for our previous problem, we have:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{Z}{K} = \frac{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} = \frac{p(\theta \in [0.50, 1] \mid y, \mathcal{M}_1)}{p(\theta \in [0.50, 1] \mid \mathcal{M}_1)} \enspace ,</script>
<p>which is, as claimed above, the posterior probability of values for $\theta$ that are in line with the restriction divided by the prior probability of values for $\theta$ that are in line with the restriction. Note that this holds for arbitrary restrictions of an arbitrary number of parameters (see Kluglist & Hoijtink, 2005). In the limit where we take the restriction to be infinitesimally small, that is, constrain the parameter to be a point value, this results in the Savage-Dickey density ratio (Wetzels, Grasman, & Wagenmakers, 2010).</p>
<!-- To illustrate this, assume that Hermione could have believed that $\theta$ is equally likely to be smaller $0.25$ or larger than $0.75$. Her prior and posterior are visualized in the figure below. -->
<!-- ```{r, echo = FALSE, fig.width = 10, fig.height = 5, fig.align = 'center', message = FALSE, warning = FALSE, dpi=400} -->
<!-- library('latex2exp') -->
<!-- x <- seq(.000, 1, .001) -->
<!-- par(mfrow = c(1, 2)) -->
<!-- Hermione_prior <- function(x) { -->
<!-- if (x < .25) { -->
<!-- res <- dunif(x, 0, 0.25) / 2 -->
<!-- } else { -->
<!-- res <- dunif(x, 0.75, 1) / 2 -->
<!-- } -->
<!-- res -->
<!-- } -->
<!-- Hermione_posterior <- function(x, y = 9, n = 10) { -->
<!-- fn <- function(x) { -->
<!-- Hermione_prior(x) * dbinom(y, n, x) -->
<!-- } -->
<!-- 2 * Hermione_prior(x) * dbinom(y, n, x) -->
<!-- } -->
<!-- plot( -->
<!-- x, sapply(x, Hermione_prior), xlim = c(0, 1), type = 'l', ylab = 'Density', lty = 1, -->
<!-- xlab = TeX('$\\theta$'), las = 1, main = 'Hermione\'s Prior', lwd = 3, ylim = c(0, 4), -->
<!-- cex.lab = 1.5, cex.main = 1.5, col = 'skyblue', axes = FALSE -->
<!-- ) -->
<!-- axis(1, at = seq(0, 1, .2)) #adds custom x axis -->
<!-- axis(2, las = 1) # custom y axis -->
<!-- plot( -->
<!-- x, sapply(x, Hermione_posterior), xlim = c(0, 1), type = 'l', ylab = 'Density', lty = 1, -->
<!-- xlab = TeX('$\\theta$'), las = 1, main = 'Hermione\'s Posterior', lwd = 3, ylim = c(0, 4), -->
<!-- cex.lab = 1.5, cex.main = 1.5, col = 'darkorchid1', axes = FALSE -->
<!-- ) -->
<!-- axis(1, at = seq(0, 1, .2)) #adds custom x axis -->
<!-- axis(2, las = 1) # custom y axis -->
<!-- ``` -->
<!-- The Bayes factor in favour of Hermione compared to Harry is given by: -->
<!-- ```{r} -->
<!-- K <- 2 -->
<!-- Z <- pbeta(0.25, 10, 2) + (1 - pbeta(0.75, 10, 2)) -->
<!-- Z / K -->
<!-- ``` -->
<p>In the next section, we apply this idea to a data set that relates Hogwarts Houses to personality traits.</p>
<h1 id="hogwarts-houses-and-personality">Hogwarts Houses and personality</h1>
<p>So, are you a Slytherin, Hufflepuff, Ravenclaw, or Gryffindor? And what does this say about your personality?</p>
<p>Inspired by Crysel et al. (2015), Lea Jakob, Eduardo Garcia-Garzon, Hannes Jarke, and I analyzed self-reported personality data from 847 people as well as their self-reported Hogwards House affiliation.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> We wanted to answer questions such as: do people who report belonging to Slytherin tend to score highest on Narcissism, Machiavellianism, and Psychopathy? Are Hufflepuffs the most agreeable, and Gryffindors the most extraverted? The Figure below visualizes the raw data.</p>
<div style="text-align:center;">
<img src="../assets/img/Potter-Personality.png" align="center" style="margin-top: -10px; padding-bottom: 0px;" width="680" height="540" />
</div>
<p>We used a between-subjects ANOVA as our model and, in the case of for example Agreeableness, compared the following hypotheses:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{H}_0&: \mu_H = \mu_G = \mu_R = \mu_S \\[.5em]
\mathcal{H}_r&: \mu_H > (\mu_G , \mu_R , \mu_S) \\[.5em]
\mathcal{H}_1&: \mu_H , \mu_G , \mu_R , \mu_S
\end{aligned} %]]></script>
<p>We used the BayesFactor R package to compute the Bayes factor in favour of $\mathcal{H}_1$ compared to $\mathcal{H}_0$. For the restricted hypotheses $\mathcal{H}_r$, we used the prior/posterior trick outlined above; and indeed, we found strong evidence in favour of the notion that, for example, Hufflepuffs score highest on Agreeableness. Curious about Slytherin and the other Houses? You can read the published paper with all the details <a href="https://www.collabra.org/article/10.1525/collabra.240/">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Participating in a relaxing prediction contest, we saw how three subjective Bayesians named Ron, Harry, and Hermione formalized their beliefs and derived their predictions about the likely outcome of ten coin flips. By restricting her prior beliefs about the bias of the coin to exclude values smaller than $\theta = 0.50$, Hermione was the boldest in her predictions and was ultimately rewarded. However, if the outcome of the coin flips would have turned out differently, say $y = 2$, then Hermione would have immediately realized how wrong her beliefs were. I think we as scientists need to be more like Hermione: we need to make more precise predictions, allowing us to construct more powerful tests and “fail” in insightful ways.</p>
<p>We also saw a neat trick by which one can compute the Bayes factor in favour of a restricted model compared to an unrestricted model by estimating the proportion of prior and posterior values of the parameter that are in line with the restriction — no painstaking computation of marginal likelihoods required! We used this trick to find evidence for what we all knew deep in our hearts already: Hufflepuffs are <em>so</em> agreeable.</p>
<hr />
<p><em>I would like to thank Sophia Crüwell and Lea Jakob for helpful comments on this blog post.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Klugkist, I., Kato, B., & Hoijtink, H. (<a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9574.2005.00279.x">2005</a>). Bayesian model selection using encompassing priors. <em>Statistica Neerlandica, 59</em>(1), 57-69.</li>
<li>Wetzels, R., Grasman, R. P., & Wagenmakers, E. J. (<a href="https://www.sciencedirect.com/science/article/pii/S0167947310001180">2010</a>). An encompassing prior generalization of the Savage–Dickey density ratio. <em>Computational Statistics & Data Analysis, 54</em>(9), 2094-2102.</li>
<li>Etz, A., Haaf, J. M., Rouder, J. N., & Vandekerckhove, J. (<a href="https://journals.sagepub.com/doi/full/10.1177/2515245918773087">2018</a>). Bayesian inference and testing any hypothesis you can specify. <em>Advances in Methods and Practices in Psychological Science, 1</em>(2), 281-295.</li>
<li>Crysel, L. C., Cook, C. L., Schember, T. O., & Webster, G. D. (<a href="https://www.sciencedirect.com/science/article/abs/pii/S0191886915002615">2015</a>). Harry Potter and the measures of personality: Extraverted Gryffindors, agreeable Hufflepuffs, clever Ravenclaws, and manipulative Slytherins. <em>Personality and Individual Differences, 83</em>, 174-179.</li>
<li>Jakob, L., Garcia-Garzon, E., Jarke, H., & Dablander, F. (<a href="https://www.collabra.org/article/10.1525/collabra.240/">2019</a>). The Science Behind the Magic? The Relation of the Harry Potter “Sorting Hat Quiz” to Personality and Human Values. <em>Collabra: Psychology, 5</em>(1).</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The analytical solution is <a href="https://www.wolframalpha.com/input/?i=Integral%5Btheta%5Ey+*+%281+-+theta%29%5E%28n+-+y%29%2C+theta%2C+0.50%2C+1%5D">unpleasant</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>You can discover your Hogwarts House affiliation at <a href="https://www.pottermore.com/">https://www.pottermore.com/</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderIf you are reading this, you are probably a Ravenclaw. Or a Hufflepuff. Certainly not a Slytherin … but maybe a Gryffindor? In this blog post, we let three subjective Bayesians predict the outcome of ten coin flips. We will derive prior predictions, evaluate their accuracy, and see how fortune favours the bold. We will also discover a neat trick that allows one to easily compute Bayes factors for models with parameter restrictions compared to models without such restrictions, and use it to answer a question we truly care about: are Slytherins really the bad guys? Preliminaries As in a previous blog post, we start by studying coin flips. Let $\theta \in [0, 1]$ be the bias of the coin and let $y$ denote the number of heads out of $n$ coin flips. We use the Binomial likelihood and a Beta prior for $\theta$: This prior is conjugate for this likelihood which means that the posterior is again a Beta distribution. The Figure below shows two examples of this. In this blog post, we will use a prior predictive perspective on model comparison by means of Bayes factors. For an extensive contrast with a perspective based on posterior prediction, see this blog post. The Bayes factor indicates how much better a model $\mathcal{M}_1$ predicts the data $y$ relative to another model $\mathcal{M}_0$: where we can write the marginal likelihood of a generic model $\mathcal{M}$ more complicatedly to see the dependence on the model’s priors: After these preliminaries, in the next section, we visit Ron, Harry, and Hermione in Hogwarts. The Hogwarts prediction contest Ron, Harry, and Hermione just came back from a straining adventure — Death Eaters and all. They deserve a break, and Hermione suggests a small prediction contest to relax. Ron is put off initially; relaxing by thinking? That’s not his style. Harry does not care either way; both are eventually convinced. The goal of the contest is to accuratly predict the outcome of $n = 10$ coin flips. Luckily, this is not a particularly complicated problem to model, and we can use the Binomial likelihood we have discussed above. In the next section, Ron, Harry, and Hermione — all subjective Bayesians — clearly state their prior beliefs which is required to make predictions. Prior beliefs Ron is not big on thinking, and so trusts his previous intuitions that coins are usually unbiased; he specifies a point mass on $\theta = 0.50$ as his prior. Harry spreads his bets evenly, and believes that all chances governing the coin flip’s outcome are equally likely; he puts a uniform prior on $\theta$. Hermione, on the other hand, believes that the coin cannot be biased towards tails; instead, she believes that all values $\theta \in [0.50, 1]$ are equally likely. She thinks this because Dobby — the house elf — is the one who throws the coin, and she has previously observed him passing time by flipping coins, which strangely almost always landed up heads. To sum up, their priors are: which are visualized in the Figure below. In the next section, the three use their beliefs to make probabilistic predictions. Prior predictions Ron, Harry, and Hermione are subjective Bayesians and therefore evaluate their performance by their respective predictive accuracy. Each of the trio has a prior predictive distribution. For Ron, true to character, this is the easiest to derive. We associate model $\mathcal{M}_0$ with him and write: where the integral — the sum! — is just over the value $\theta = 0.50$. Ron’s prior predictive distribution is simply a Binomial distribution. He is delighted by this fact, and enjoys a short rest while the others derive their predictions. It is Harry’s turn, and he is a little put off by his integration problem. However, he realizes that the integrand is an unnormalized Beta distribution, and swiftly writes down its normalizing constant, the Beta function. Associating $\mathcal{M}_1$ with him, his steps are: which is a Beta-Binomial distribution with $\alpha = \beta = 1$. Hermione’s integral is the most complicated of the three, but she is also the smartest of the bunch. She is a master of the wizardry that is computer programming, which allows her to solve the integral numerically.1 We associate $\mathcal{M}_r$, which stands for restricted model, with her and write: We can draw from the prior predictive distributions by simulating from the prior and then making predictions through the likelihood. For Hermione, for example, this yields: The analytical solution is unpleasant. ↩Love affairs and linear differential equations2019-08-29T11:00:00+00:002019-08-29T11:00:00+00:00https://fabiandablander.com/r/Linear-Love<p>Differential equations are a powerful tool for modeling how systems change over time, but they can be a little hard to get into. Love, on the other hand, is humanity’s perennial topic; some even claim it is all you need. In this blog post — inspired by Strogatz (1988, 2015) — I will introduce linear differential equations as a means to study the types of love affairs two people might find themselves in.</p>
<p>Do opposites attract? What happens to a relationship if lovers are out of touch with their own feelings? We will answer these and other questions using two coupled linear differential equations. On our journey, we will use graphical as well as mathematical methods to classify the types of relationships this modeling framework can accommodate. In a follow-up blog post, we will also play around with non-linear terms and add a third wheel to the mix, which can lead to chaos — in the technical sense of the term, of course. Excited? Then let’s get started!</p>
<h1 id="introducing-romeo">Introducing Romeo</h1>
<blockquote>
A lovestruck Romeo sang the streets of serenade <br />
Laying everybody low with a love song that he made <br />
Finds a streetlight, steps out of the shade <br />
Says something like, "You and me, babe, how about it?"
</blockquote>
<p>Romeo is quite the emotional type. Let $R(t)$ denote his feelings at time point $t$. Following common practice, we will usually write $R$ instead of $R(t)$, making the time dependence implicit. The process which describes how Romeo’s feelings change is rather simple: it depends only on Romeo’s current feelings. We write:</p>
<script type="math/tex; mode=display">\frac{\mathrm{d}R}{\mathrm{d}t} = aR \enspace ,</script>
<p>which is a linear differential equation. Note that this <em>implicitly</em> encodes how Romeo’s feelings change over time, since when we know $R$ at time point $t$, we can compute the direction and speed with which $R$ will change — the derivative denotes velocity. Our goal, however, is to find an explicit, closed-form expression for Romeo’s feelings at time point $t$. In this particular case, we can do this analytically:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= aR \\[.5em]
\frac{1}{aR}\mathrm{d}R &= \mathrm{dt} \\[.5em]
\frac{1}{a}\int \frac{1}{R}\mathrm{d}R &= \int \mathrm{dt} \\[.5em]
\frac{1}{a} \left[\text{log} \, R + C \right] &= t \\[.5em]
\text{log} \, R &= a t - C \\[.5em]
R &= e^{at - C} \enspace .
\end{aligned} %]]></script>
<p>A differential equation describes how something changes; to kickstart the process, we need an initial condition $R_0$. This allows us to find the constant of integration $C$. In particular, assume that $R = R_0$ at $t = 0$, which leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R_0 &= e^{-C} \\[.5em]
\text{log} \, R_0 &= -C \enspace ,
\end{aligned} %]]></script>
<p>such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R &= e^{at + \text{log} \, R_0} \\[.5em]
R &= R_0 e^{at} \enspace .
\end{aligned} %]]></script>
<p>The left two panels of the figure below visualize how Romeo’s feelings change over time for $a > 0$ with initial condition $R_0 = 1$ (top) or $R_0 = -1$ (bottom). The right two panels show how his feelings change for $a < 0$ with $R_0 = 100$ (top) or $R_0 = -100$ (bottom).</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>We conclude: Romeo is a simple guy, albeit with an emotion regulation problem. When the object of his affection is such that $a > 0$, his feelings will either grow exponentially towards mad love if he starts out with a positive first impression ($R_0 > 0$), or grow exponentially towards mad hatred if he starts out with a negative first impression ($R_0 < 0$). On the other hand, if $a < 0$, then regardless of his initial feelings, they will decay exponentially towards indifference.</p>
<p>For $R_0 = 0$, Romeo’s feelings never change. For any other initial condition, we have uhindered, exponential growth when $a > 0$; it never stops. For any other initial condition and $a < 0$, we crash down to zero very rapidly. Thus $R = 0$ is a <em>fixed point</em> in both cases, which is <em>stable</em> for $a < 0$ but becomes <em>unstable</em> if $a > 0$. We can visualize this in <em>phase space</em> on a line. The phase space is filled with all possible trajectories because each point can serve as the initial condition.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In the next section, a wonderful new episode in Romeo’s life begins: he meets Juliet.</p>
<h1 id="introducing-juliet">Introducing Juliet</h1>
<blockquote>
Juliet says, "Hey, it's Romeo, you nearly gave me a heart attack" <br />
He's underneath the window, she's singing, "Hey, la, my boyfriend's back <br />
You shouldn't come around here singing up at people like that <br />
Anyway, what you gonna do about it?"
</blockquote>
<p>Life becomes more complicated for Romeo now that Juliet is in his life. It is their first real relationship, and they have much to learn. We start simple. Let $J$ denote Juliet’s feelings for Romeo, and let $R$ denote Romeo’s feelings for Juliet. We can extend our single linear differential equation from above to a system of two linear differential equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= aR\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= dJ \enspace .
\end{aligned} %]]></script>
<p>Using the results from above, the solutions to the two differential equations are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R(t) &= R_0 e^{at} \\[.5em]
J(t) &= J_0 e^{dt} \enspace ,
\end{aligned} %]]></script>
<p>where $R(t)$ and $J(t)$ give the trajectories of love for Romeo and Juliet, respectively, and $J_0$ is Juliet’s initial feeling towards Romeo at $t = 0$. In contrast to the one-dimensional phase diagram from above, we now have a two-dimensional picture which is known as a <em>vector field</em>.</p>
<p>Analogously to the case of a single differential equation, $a < 0$ and $d < 0$ imply exponential decay for Romeo and Juliet’s love, and $a > 0$ and $d > 0$ imply exponential growth. The left figure below visualizes decay: whatever the initial state of their love, it will crash into the origin of indifference. The figure on ther right visualizes growth: whatever the initial state, except for indifference, their feelings will grow exponentially and eventually consume them.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>This can result in happy, ever increasing love if they start out liking each other (top right quadrant), but can also result in an increasingly violent feud if they start out disliking each other (bottom left quadrant). For asymmetric starts, one of them will be hopelessly in love with the other, while the other’s hate grows unboundedly. The fixed point (0, 0) is <em>stable</em> on the left, as any tiny perturbation will move the system towards it. In contrast, the fixed point on the right is <em>unstable</em>, as any ounce of love or hate, no matter how small, will make the system explode. One unfortunate subtlety arises, however: if Romeo loves Juliet, but Juliet is indifferent, then Juliet will forever stay indifferent even though Romeo’s love grows without bound.</p>
<p>Another interesting case occurs when their affection is asymmetric, i.e., $a \neq d$. The figure below on the left shows one such case for negative parameters: we see that whatever feelings Juliet has for Romeo, they decay faster then the feelings Romeo has for Juliet. Moreover, since $(a, d) < 0$, the origin is stable. The figure on the right shows a more impactful asymmetry: Romeo’s feelings decay ($a < 0$), but Juliet’s increase ($d > 0$). Regardless of what the initial feelings of Romeo are, he will always end up in a state of indifference with respect to Juliet (all arrows point to the y-axis). Juliet, on the other hand, will go increasingly mad with love or hate, depending on her initial feelings — the exception being if she starts out with indifference ($J_0 = 0$): then she will stay indifferent. This type of fixed point is called a <em>saddle point</em>, which occurs if there is one vector along which the system is stable (here the x-axis) and one vector along which the system is unstable (here the y-axis).</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>What happens if Romeo’s feelings never change, i.e., $a = 0$? This is visualized as the figure on the left below: Romeo’s feelings will always stay at the initial point. Juliet’s feelings decrease ($d < 0$), so regardless of where she is, the system will end up at a stable fixed point on the x-axis. A similar situation occurs if Juliet’s feelings never change, and Romeo’s feelings decay ($a < 0$), which is visualized in the figure on the right: all points on the y-axis are stable fixed points. If instead the moving parties’ feelings would increase instead of decay, the fixed points would be unstable. The most boring case is $a = 0$ and $d = 0$, because then every point on the plane is a fixed point: however the two lovers start, they will never change.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Note that in all the love affairs described above, the feelings of Romeo and Juliet are actually <em>independent of each other</em>. They <em>do not communicate</em> with each other, and <em>we all know that communication is key</em>! In the next section, Romeo and Juliet’s relationship matured and they start taking each other seriously. Formally, we <em>couple</em> the two love birds and analyze what types of love this can set free.</p>
<h1 id="coupled-differential-equations">Coupled differential equations</h1>
<p>In the previous section, we saw that the behaviour of the system was determined entirely by the values of $a$ and $d$ — depending on whether $a$ or $d$ were positive, negative, or zero, the system would either be stable or unstable along the $R$ or $J$ dimension. Incorporating communication complicates the system, but is ultimately for the better. To model the fact that Romeo and Juliet now respond to each other’s feelings, we simply write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= aR + bJ\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= cR + dJ \enspace ,
\end{aligned} %]]></script>
<p>or in matrix form:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix}
\frac{\mathrm{d}R}{\mathrm{d}t} \\
\frac{\mathrm{d}J}{\mathrm{d}t}
\end{pmatrix} &= \begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} R \\ J\end{pmatrix} \\[.5em]
\dot{\mathbf{x}} &= A \mathbf{x} \enspace .
\end{aligned} %]]></script>
<p>The classification of such a system is more difficult. In the next section, we introduce one type of relationship between a matured Romeo and Juliet that will motivate a general solution to coupled differential equations.</p>
<h2 id="the-saddle-of-love">The Saddle of love</h2>
<blockquote>
I might not be the right one <br />
It might not be the right time <br />
But there's something about us I've got to do <br />
Some kind of secret I will share with you
</blockquote>
<p>In a previous life, Juliet and Romeo did not communicate ($b = c = 0$) but listened to their own feelings in opposite ways ($a = -1$ and $d = 1$). Here we include communication:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= -2R + J\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= -R + 2J \enspace .
\end{aligned} %]]></script>
<p>Specifically, Romeo dampens his feelings the more strongly he feels ($a = -2$) and listens to Juliet such that whichever way her feelings go, Romeo’s follow suit ($b = 1$). Juliet does the opposite: she increases her feelings of love or hate the more strongly she feels ($d = 2$), and responds to Romeo such that whichever way his feelings go, Juliet’s feelings move the other way ($c = -1$). In a sense, Romeo and Juliet are opposites — can any good come from this?</p>
<p>Before answering this question, we first find a general solution to systems of linear differential equations. This gives us a way to formally classify any (linear) relationships between Romeo and Juliet. The solution will involve eigenvectors and eigenvalues, so let’s put our sleeves up and get to work!</p>
<h2 id="solving-coupled-differential-equations">Solving coupled differential equations</h2>
<p>In contrast to the first system of linear equations above where Romeo and Juliet did not communicate with each other, the system now is <em>coupled</em>: Romeo’s feelings influence Juliet’s and vice versa. Now, if their feelings would instead be independent, then the solution to the differential equations would be easy: just as above, their respective feelings would either grow or decay exponentially. The dependence between their feelings is encoded in the matrix $A$. If $A$ were diagonal, then the equations would be independent.</p>
<p>The solution to our problem thus presents itself: somehow, we must manage to make the matrix $A$ diagonal. We can do this by changing basis, a trick we have also used in deriving a <a href="https://fabiandablander.com/r/Fibonacci.html">closed-form expression of the Fibonacci numbers</a> in a previous blog post. If you are unfamiliar with these ideas, it might pay to read the previous blog post before proceeding.</p>
<p>Assuming that $A$ is <em>diagonalizable</em> (more on that latter), we can write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A &= E \Lambda E^{-1} \\[.5em]
\begin{pmatrix} a & b \\ c & d \end{pmatrix} &= \begin{pmatrix} \mathbf{v}_1 & \mathbf{v}_2\end{pmatrix}\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix} \begin{pmatrix} \mathbf{v}_1 & \mathbf{v}_2\end{pmatrix}^{-1} \enspace ,
\end{aligned} %]]></script>
<p>where $(\lambda_1, \lambda_2)$ are the eigenvalues of $A$ and $\mathbf{v}_1$ and $\mathbf{v}_2$ are the respective eigenvectors. Conceptually, multiplying a vector with $E^{-1}$ changes its basis from the standard basis to the basis of eigenvectors. In this space, the matrix encoding the dependence between our two differential equations is the diagonal matrix of eigenvalues $\Lambda$ — the two differential equations are independent! We know that in this space the solution to the differential equations are independent exponential functions. However, we have to change back to our standard basis, and we do so by multiplying with $E$.</p>
<p>With this insight, our system of differential equation becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\dot{\mathbf{x}} &= E \Lambda E^{-1} \mathbf{x} \\
E^{-1} \dot{\mathbf{x}} &= \Lambda E^{-1} \mathbf{x} \\
\dot{\mathbf{u}} &= \Lambda \mathbf{u} \enspace ,
\end{aligned} %]]></script>
<p>where we have defined $\mathbf{u} = E^{-1}\mathbf{x}$, which is now with respect to the eigenbasis. Now since:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Lambda = \begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix} \enspace , %]]></script>
<p>the solution to the two differential equations is:</p>
<script type="math/tex; mode=display">\mathbf{u} = \begin{pmatrix} C_1e^{\lambda_1 t} \\ C_2e^{\lambda_2 t} \end{pmatrix} \enspace ,</script>
<p>where $C_1$ and $C_2$ are the constants of integration which we earlier denoted as $R_0$ and $J_0$. To change back basis, we multiply with $E$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{x} &= E \mathbf{u} \\[.5em]
\mathbf{x} &= \begin{pmatrix} \mathbf{v}_1 & \mathbf{v}_2 \end{pmatrix} \begin{pmatrix} C_1e^{\lambda_1 t} \\ C_2e^{\lambda_2 t} \end{pmatrix} \\[.5em]
\mathbf{x} &= \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \enspace ,
\end{aligned} %]]></script>
<p>where $\mathbf{v}_1$ and $\mathbf{v}_2$ are eigenvectors and $\lambda_1$ and $\lambda_2$ are the corresponding eigenvalues. Therefore, solving a system of ordinary linear differential equations reduces to finding the eigenvalues and eigenvectors of the matrix $A$.</p>
<h2 id="finding-eigenvalues-and-eigenvectors">Finding eigenvalues and eigenvectors</h2>
<p>An eigenvector of a matrix is a vector that is only stretched by the matrix by a factor of $\lambda$, such that for $v \neq 0$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A\mathbf{v} &= \lambda \mathbf{v} \\[.5em]
(A - I\lambda) \mathbf{v} &= 0 \enspace ,
\end{aligned} %]]></script>
<p>which is true when the determinant of $(A - I\lambda)$ is zero, that is, $\left\vert A - I\lambda\right\vert = 0$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\left\vert\begin{pmatrix} a & b \\ c & d\end{pmatrix} - \begin{pmatrix} \lambda & 0 \\ 0 & \lambda \end{pmatrix}\right\vert &= 0 \\[1em]
\left\vert\begin{pmatrix} a - \lambda & b \\ c & d - \lambda \end{pmatrix}\right\vert &= 0 \\[1em]
(a - \lambda)(d - \lambda) - bc &= 0 \\[1em]
\lambda^2 - \lambda(a + d) - ad + bc &= 0 \enspace .
\end{aligned} %]]></script>
<p>We define:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\tau &\equiv \text{trace}(A) = a + d \\[.5em]
\Delta &\equiv \vert A\vert = ad - bc \enspace ,
\end{aligned} %]]></script>
<p>and recall the quadratic formula to find both eigenvalues:</p>
<script type="math/tex; mode=display">\lambda = \frac{\tau \pm \sqrt{\tau^2 - 4\Delta}}{2} \enspace .</script>
<p>In the next section, we apply this to the “saddle of love” differential equation in order to better understand the trajectories Romeo and Juliet’s love could take.</p>
<h2 id="solving-the-saddle-of-love">Solving the saddle of love</h2>
<p>Recall that we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= -2R + J\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= -R + 2J \enspace .
\end{aligned} %]]></script>
<p>such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -2 & 1 \\ -1 & 2\end{pmatrix} \enspace . %]]></script>
<p>For our saddle of love, the eigenvalues are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda &= \frac{0 \pm \sqrt{0 - 4\cdot(-3)}}{2} = \frac{\pm \sqrt{4 \cdot 3}}{2} = \pm \sqrt{3} \enspace .
\end{aligned} %]]></script>
<p>To find the first eigenvector, we compute for the first eigenvalue $\lambda_1 = \sqrt{3}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(A - I\lambda_1)\mathbf{v}_1 &= 0 \\[.5em]
\begin{pmatrix} -2 - \sqrt{3} & -1 \\ 1 & 2 - \sqrt{3} \end{pmatrix} \begin{pmatrix} v_1 \\ v_2\end{pmatrix} &= \begin{pmatrix} 0 \\ 0 \end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>which has solution $\mathbf{v}_1 = (1, 2 + \sqrt{3})^T$. For $\lambda_2 = -\sqrt{3}$, the eigenvector is $\mathbf{v}_2 = (1, 2 - \sqrt{3})^T$. We can verify this in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## eigen() decomposition
## $values
## [1] -1.732051 1.732051
##
## $vectors
## [,1] [,2]
## [1,] -0.9659258 -0.2588190
## [2,] -0.2588190 -0.9659258</code></pre></figure>
<p>which scales the eigenvectors to have unit length by dividing by its norm, and in this case also multiplies by $-1$; this does not matter, as eigenvectors are only defined up to a constant factor.</p>
<p>Plugging the eigenvalues and eigenvectors into our general solution form yields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{x} &= \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \\[.5em]
\mathbf{x} &= \begin{pmatrix} 1 \\ 2 + \sqrt{3} \end{pmatrix} C_1e^{\sqrt{3} t} + \begin{pmatrix} 1 \\ 2 - \sqrt{3} \end{pmatrix} C_2e^{-\sqrt{3}t} \enspace .
\end{aligned} %]]></script>
<p>We still need to solve for the constants $C_1$ and $C_2$. Assume that at $t = 0$, the feelings for Romeo and Juliet are $\mathbf{x} = (1, 1)^T$. Then we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} 1 \\ 1 \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ 2 + \sqrt{3} & 2 - \sqrt{3} \end{pmatrix} \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} \\[.5em]
\begin{pmatrix} 1 & 1 \\ 2 + \sqrt{3} & 2 - \sqrt{3} \end{pmatrix}^{-1}\begin{pmatrix} 1 \\ 1 \end{pmatrix} &= \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} \\[.5em]
\begin{pmatrix} 0.21 \\ 0.79 \end{pmatrix} &= \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>which yields the following solutions for Romeo and Juliet:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R(t) &= 0.21 \cdot e^{\sqrt{3} t} + 0.79 \cdot e^{-\sqrt{3} t} \\[.5em]
J(t) &= 0.79 \cdot e^{\sqrt{3} t} + 0.21 \cdot e^{-\sqrt{3} t} \enspace .
\end{aligned} %]]></script>
<p>Note how this result differs from when Romeo and Juliet did not communicate: the solution is a linear combination of two exponentials — the two lovebirds are clearly coupled! Now that we have seen one worked example, the code below computes the trajectory of Romeo and Juliet for an arbitrary matrix $A$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_linear</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">tmax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># compute eigenvectors and eigenvalues</span><span class="w">
</span><span class="n">eig</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eig</span><span class="o">$</span><span class="n">vectors</span><span class="w">
</span><span class="n">lambdas</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eig</span><span class="o">$</span><span class="n">values</span><span class="w">
</span><span class="c1"># solve for the initial conditon</span><span class="w">
</span><span class="n">C</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">inits</span><span class="w">
</span><span class="c1"># create time steps</span><span class="w">
</span><span class="n">ts</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tmax</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">t</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ts</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="p">(</span><span class="n">C</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">lambdas</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">t</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Re drops the imaginary part ... more on that later!</span><span class="w">
</span><span class="nf">Re</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The code for visualizing vector fields for two coupled linear differential equations is given below.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'fields'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_vector_field</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">.50</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">.50</span><span class="p">)</span><span class="w">
</span><span class="n">RJ</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">expand.grid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w">
</span><span class="n">dRJ</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">RJ</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="w">
</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">arrow.plot</span><span class="p">(</span><span class="w">
</span><span class="n">RJ</span><span class="p">,</span><span class="w"> </span><span class="n">dRJ</span><span class="p">,</span><span class="w">
</span><span class="n">arrow.ex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.05</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray82'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">3.9</span><span class="p">,</span><span class="w"> </span><span class="m">-.2</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.2</span><span class="p">,</span><span class="w"> </span><span class="m">3.9</span><span class="p">,</span><span class="w"> </span><span class="s1">'J'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">-4</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Before we visualize the vector field, let me again stress that the solution to a system of two coupled linear differential equation is of the form:</p>
<script type="math/tex; mode=display">\mathbf{x} = \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \enspace .</script>
<p>The eigenvectors coincide with the standard basis vectors when the two differential equations are independent, as was the case above when Romeo and Juliet did not communicate. In such cases, exponential growth or decay is along the standard basis vectors, i.e., the x- and y-axes. For the case we are considering now, this is not the true — the eigenvectors are different from the standard basis vectors. It therefore makes sense to visualize the eigenvectors, as they are in some sense more fundamental to the solution. However, we want to retain the interpretability of the standard basis, as this is our reference frame for the initial condition. In the following visualizations, therefore, we add the eigenvectors which makes it apparent exactly in which directions there is exponential growth or decay.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_eigenvectors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="o">$</span><span class="n">vectors</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">4</span><span class="w">
</span><span class="n">arrows</span><span class="p">(</span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="n">arrows</span><span class="p">(</span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">add_line</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">solve_linear</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inits</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'red'</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">plot_vector_field</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'The Saddle of Love'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_eigenvectors</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">.75</span><span class="p">))</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">.75</span><span class="p">))</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.7</span><span class="p">)</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2.2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'white'</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<p>The figure above visualizes the resulting vector field, the standard basis (solid lines), the eigenvectors (dashed lines), and four example trajectories (red lines). The eigenvectors define different quadrants than the standard basis. If Romeo and Juliet start in the top right or top left eigenquadrant, then their love grows exponentially. If they start in the bottom left or bottom right eigenquadrant, their hate grows exponentially. Note that we again have a saddle point, as there is exponential decay along one eigenvector and exponential growth along the other; only if Romeo and Juliet’s initial feelings are exactly on the decaying eigenvector do we end up in a state of indifference.</p>
<!-- An interesting case is if Juliet starts out positive while Romeo has initial feelings of hate, but not too much so that they are in the bottom left eigenquadrant, their love grows eternally. This makes sense: Romeo downweights his own feelings ($b = -2$) and is positively influenced by Juliet's love ($d = 1$). -->
<!-- On the other hand, if Juliet starts out negative and Romeo starts with love, they increasingly hate each other. This is again reasonable, as Romeo downplays his own positive feelings and "takes over" Juliet's negative ones. So, too, is their fate when both start out with hate. -->
<p>In the next section, we go beyond the saddle of love and study what different matrices $A$ imply for the stability landscape of love affairs.</p>
<h1 id="a-classification-of-linear-systems">A classification of linear systems</h1>
<p>All we need to know to classify the relationship between Romeo and Juliet is the trace $\tau = a + d$ and the determinant $\Delta = ad - bc$ of the matrix $A$. We can rewrite these in terms of eigenvalues:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda_1 + \lambda_2 &= \frac{1}{2}\left(\tau + \sqrt{\tau^2 - 4\Delta}\right) + \frac{1}{2}\left(\tau - \sqrt{\tau^2 - 4\Delta}\right) = \tau \\[.5em]
\lambda_1 \lambda_2 &= \frac{1}{2}\left(\tau + \sqrt{\tau^2 - 4\Delta}\right)\frac{1}{2}\left(\tau - \sqrt{\tau^2 - 4\Delta}\right) \\[.5em]
&= \frac{1}{4} \left(\tau^2 - \tau^2 + 4\Delta\right) \\[.5em]
&= \Delta \enspace ,
\end{aligned} %]]></script>
<p>which means that we can characterize a linear system solely by its eigenvalues. If $\lambda_1 < 0$ we have exponential decay and if $\lambda_1 > 0$ we have exponential growth in the direction of the first eigenvector, $\mathbf{v}_1$. The same holds for $\lambda_2$ and $\mathbf{v}_2$.</p>
<h2 id="keepin-it-real-attracting-and-repelling-nodes">Keepin’ it real: Attracting and repelling nodes</h2>
<blockquote>
No it ain't no use in callin' out my name, gal <br />
Like you never done before <br />
And it ain't no use in callin' out my name, gal <br />
I can't hear ya any more.
</blockquote>
<p>If $\tau^2 - 4\Delta > 0$, both eigenvalues are real. If both are negative, then the origin is an attracting fixed point; if they are positive, the origin is a repelling fixed point. As an example, take this matrix:</p>
<script type="math/tex; mode=display">% <![CDATA[
A_{1} = \begin{pmatrix} -1 & 0.50 \\ 1 & -1\end{pmatrix} \enspace , %]]></script>
<p>which means that Romeo downplays his feelings as strongly as Juliet, but is influenced only half as strongly by Juliet’s feeling as Juliet is by his feelings.</p>
<p>The matrix:</p>
<script type="math/tex; mode=display">% <![CDATA[
A_{2} = \begin{pmatrix} 1 & 0.50 \\ 0.25 & 0.50 \end{pmatrix} \enspace , %]]></script>
<p>shows what both Romeo and Juliet reinforce each other’s feeings ($b = 0.50$ and $c = 0.25$) as well as their own ($a = 1$ and d = $0.50$). We know from above that this cannot be mathematically stable!</p>
<p>The figure on the left below shows that indifference is the result of the relationship govenered by $A_1$, regardless of the starting point. Nodes generally have a slow and a fast eigendirection; the larger the eigenvalue, the stronger the pull in the direction of the corresponding eigenvector. For the stable node on the left, the fast eigendirection is clearly given by the negative eigenvector — all trajectories are strongly pulled in its direction; only gradually are they pulled in the other eigendirection until they end up at the origin.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>The figure on the right shows the relationship governed by $A_2$, which yields a more tumultuous love affair. In particular, Romeo and Juliet always have opposite feelings toward each other that also grow exponentially: Romeo becomes madder and madder in love with Juliet while Juliet becomes more and more hateful towards him, or the reverse — it doesn’t matter how loud one of them calls the other, there will be no positive response. The fast eigendirection is now given by the positive eigenvector; all trajectories initially go up (or down) a bit, before they get pulled heavily in the eigenvector’s direction, moving almost parallel to it.</p>
<p>In both the above cases, the eigenvalues are distinct. This allows one eigendirection to be slow and the other fast. In the next section, we look at what happens when both eigenvalues are equal.</p>
<h2 id="one-dimensional-love">One-dimensional love</h2>
<blockquote>
Ah, now I don't hardly know her <br />
But I think I could love her <br />
Crimson and clover
</blockquote>
<p>If $\tau^2 - 4\Delta = 0$, the matrix $A$ does not have distinct eigenvalues. We can distinguish two cases. First, as in our very first example, we could have:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} \lambda & 0 \\ 0 & \lambda \end{pmatrix} \enspace , %]]></script>
<p>which yields a <em>star node</em>: all directions point either to the origin ($\lambda < 0$) or away from it ($\lambda > 0$). We have visualized this vector field for $\lambda = -1$ and $\lambda = 1$ when Romeo met Juliet, so we do not visualize it here. In this case, $A$ is <em>diagonalizable</em>, that is, we can find matrices $\Lambda$ and $E$ such that:</p>
<script type="math/tex; mode=display">A = E \Lambda E^{-1} \enspace .</script>
<p>To see this in R, assume that $\lambda = -1$. The following could should give us $A$ back.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">check_decomposition</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">eig</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eig</span><span class="o">$</span><span class="n">vectors</span><span class="w">
</span><span class="n">Lambda</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="n">eig</span><span class="o">$</span><span class="n">values</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">Lambda</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="o">-</span><span class="n">diag</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">check_decomposition</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1] [,2]
## [1,] -1 0
## [2,] 0 -1</code></pre></figure>
<p>For $A$ to be diagonalizable, we require that $E$, the matrix of eigenvectors, is invertible. A matrix is invertible if it is <em>full rank</em>, which requires that the eigenvectors be independent, that is, they must span the plane. This brings us to the second case. Assume that:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -1 & -1 \\ 0 & -1 \end{pmatrix} \enspace . %]]></script>
<p>Then the two eigenvalues are again equal, but <em>the eigenvectors are not independent</em>. We can still compute the eigendecomposition in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">A</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">-1</span><span class="w">
</span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## eigen() decomposition
## $values
## [1] -1 -1
##
## $vectors
## [,1] [,2]
## [1,] 1 1.000000e+00
## [2,] 0 2.220446e-16</code></pre></figure>
<p>The only eigenvector is $\mathbf{v}_1 = (1, 0)^T$, even though R tells us that there are two distinct ones due to numerical imprecision. If we were to diagonalize the matrix, we would get an error, since $E$ is <em>singular</em>, that is, not invertible:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">check_decomposition</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Error in solve.default(E): system is computationally singular: reciprocal condition number = 1.11022e-16</code></pre></figure>
<p>We can, however, still visualize the vector field. We now have a <em>degenerate node</em> in which all trajectories are parallel to the eigenvector (which in this case is the x-axis, since $\mathbf{v}_1 = (1, 0)^T$).</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-16-1.png" title="plot of chunk unnamed-chunk-16" alt="plot of chunk unnamed-chunk-16" style="display: block; margin: auto;" /></p>
<p>While we can plot the vector field, we cannot use our diagonalization trick to compute a closed-form solution, since we cannot invert $E$. We could use numerical methods to compute trajectories; I will discuss this in more detail in a follow-up post on nonlinear differential equations for which we generally cannot get a closed-form expression. However, we can get such an expression for linear systems even if $A$ is not diagonalizable by using <em>matrix exponentials</em>. Since this would take us a little too far here, I defer this treatment to the <em>Post Scriptum</em>.</p>
<p>In the next two sections, we complete our classification of linear systems by allowing Romeo and Juliet’s love to oscillate.</p>
<h2 id="spiralling-love">Spiralling love</h2>
<blockquote>
Sometimes I feel so happy <br />
Sometimes I feel so sad <br />
Sometimes I feel so happy <br />
But mostly you just make me mad <br />
Baby, you just make me mad
</blockquote>
<p>Observe that:</p>
<script type="math/tex; mode=display">\lambda = \frac{\tau}{2} \pm \frac{\sqrt{\tau^2 - 4\Delta}}{2} \enspace ,</script>
<p>which will be complex if $\tau^2 - 4\Delta < 0$. We rewrite the eigenvalues slightly:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda &= \frac{\tau}{2} \pm \frac{\sqrt{\tau^2 - 4\Delta}}{2} \\[.5em]
&= \frac{\tau}{2} \pm \frac{\sqrt{-1}\sqrt{4\Delta - \tau^2}}{2} \\[.5em]
&= \alpha \pm i\omega \enspace ,
\end{aligned} %]]></script>
<p>where $\alpha = \tau / 2$ and $\omega = \sqrt{4\Delta - \tau^2} / 2$. The solution to the system of differential equation is still of the form:</p>
<script type="math/tex; mode=display">\mathbf{x} = \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \enspace .</script>
<p>However, the $\lambda$’s are now complex which results in:</p>
<script type="math/tex; mode=display">e^{\lambda t} = e^{(\alpha \pm i \omega)t} = e^{\alpha t} e^{\pm i\omega t} = e^{\alpha t} \left[\text{cos}(\omega t) + i \cdot \text{sin}(\omega t) \right] \enspace .</script>
<p>For $\alpha < 0$ and $\omega \neq 0$ we have <em>dampened oscillations</em>: they decay exponentially. For $\alpha > 0$ and $\omega \neq 0$ we have <em>amplifying oscillations</em>: they grow exponentially. To see this visually, let’s take the matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -0.20 & -1 \\ 1 & 0\end{pmatrix} \enspace . %]]></script>
<p>This implies that Romeo dampens his own feelings slightly ($a = -0.10$) and feels more love when Juliet hates him and more hate if Juliet loves him ($b = -1$). On the other hand, Juliet does not listen to her own feelings ($d = 0$) and mimicks Romeo’s feelings ($c = 1$). Where does this lead the two love birds?</p>
<p>The figure below on the left visualizes the vector field and one trajectory of love. The figure on the right visualizes Romeo and Juliet’s trajectory separately.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-17-1.png" title="plot of chunk unnamed-chunk-17" alt="plot of chunk unnamed-chunk-17" style="display: block; margin: auto;" /></p>
<p>Although both lovers start at mutual affection, over the course of their relationship, they feel happy, then sad, then happy, then sad, until they don’t feel anymore. If, on the other hand, we change $A$ to</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} 0.10 & -1 \\ 1 & 0\end{pmatrix} \enspace , %]]></script>
<p>we have $\alpha = 0.05$ which is positive. This implies slower growth than we had decay before ($\alpha = -0.10$). If we allow both lovers only an ounce of mutual affection $(0.1, 0.1)$, they will spiral forever, they feelings always growing, always changing.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-18-1.png" title="plot of chunk unnamed-chunk-18" alt="plot of chunk unnamed-chunk-18" style="display: block; margin: auto;" /></p>
<p>I encourage you to play around with the code a bit to get an intuition for these things. In the next section, we look at a special case of this linear system before we wrap-up.</p>
<h2 id="the-circle-of-love">The circle of love</h2>
<blockquote>
Oh, so long, Marianne <br />
It's time that we began to laugh <br />
And cry and cry and laugh about it all again. <br />
</blockquote>
<p>An interesting special case of the spiral of love occurs when $\alpha = 0$ such that all eigenvalues are imaginary. As an example, let $a = 0$, $b = -1$, $c = 1$, and $d = 0$ such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= -J \\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= R \enspace .
\end{aligned} %]]></script>
<p>Romeo and Juliet do not listen to their own feelings anymore, but only to their partner’s feelings. However, they do so in opposite ways. For Romeo, this model implies that when Juliet’s feelings for him are high ($J > 0$), Romeo’s feelings for Juliet <em>decrease</em>. If they are low ($J < 0$), then his feelings <em>increase</em>. For Juliet, it is exactly the opposite: when Romeo’s feelings are strong ($R > 0$), her feelings <em>increase</em>, while when his feelings wane ($R < 0$), her feelings <em>decrease</em>. Is this a (mathematically) stable relationship? To find out, we visualize the vector field below as well as three love trajectories.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-19-1.png" title="plot of chunk unnamed-chunk-19" alt="plot of chunk unnamed-chunk-19" style="display: block; margin: auto;" /></p>
<p>Romeo and Juliet are stuck in a never ending circle! Regardless of the starting point, they will be prisoners to the Sisyphean circle of love which will make them laugh and cry and cry and laugh about it all again. Except, of course, when they start at the origin $(0, 0)$: if they start with indifference, they will forever stay indifferent. Note that the fixed point is now called a <em>center</em> which is <em>neutrally stable</em>, since nearby trajectories are neither attracted nor repelled from the fixed point.</p>
<p>We have started and ended our journey of relationships with two extremes: ignoring the other’s feelings and ignoring one’s own. Both are unhealthy. <em>Communication is key</em>. In the next section, we recap the types of linear systems we have seen in this blog post.</p>
<h1 id="classification-recap">Classification recap</h1>
<blockquote>
She took off a silver locket <br />
She said remember me by this <br />
She put her hand in my pocket <br />
I got a keepsake and a kiss
</blockquote>
<p>The figure below summarizes the classification of linear systems we have, step by step, developed in this blog post (see also Strogatz, 2015, p. 140).</p>
<!-- <div style="text-align:center;"> -->
<!-- <img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" /> -->
<!-- </div> -->
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-20-1.png" title="plot of chunk unnamed-chunk-20" alt="plot of chunk unnamed-chunk-20" style="display: block; margin: auto auto auto 0;" /></p>
<p>If both $\tau = 0$ and $\Delta = 0$, the eigenvalues are zero and the solution is a constant: Romeo and Juliet’s feelings will forever stay wherever they started — we have a plane of fixed points. If $\tau = 0$ but $\Delta \neq 0$, either Romeo or Juliet’s feelings are constant, and the other person’s feelings either exponentially grow or decay — we have a line of fixed points.</p>
<p>Saddle points occur when $\tau \neq 0$ and $\Delta < 0$, which implies that one eigenvalue is positive and the other is negative, that is, we have exponentially growth in one eigendirection and exponential decay in the other; the fixed point $(0, 0)$ is generally unstable, except when the initial condition is exactly on the vector along we which there is exponential decay.</p>
<p>If $\tau = 0$ and $\Delta = 0$ all eigenvalues are imaginary, resulting in a <em>center</em> — the circle of love. These become <em>spirals</em> if $\tau \neq 0$, since the eigenvalues now have a real part which results in amplifying oscillations ($\tau > 0$) or dampened oscillations ($\tau < 0$).</p>
<p>On the parabola described by $\tau^2 - 4\Delta = 0$ we have repeated eigenvalues. If the resulting eigenvectors are independent, we have a <em>star node</em> in which all directions either point towards the origin ($\lambda < 0$) or away from it ($\lambda > 0$).</p>
<p>If the resulting eigenvectors are not independent, we have a <em>degenerate node</em>; we cannot invert the matrix of eigenvectors anymore and thus need to use other methods. One such method is provided by matrix exponentials — see the <em>Post Scriptum</em>.</p>
<p>Above the parabola, we either have <em>stable nodes</em> for $\tau < 0$ and <em>unstable nodes</em> for $\tau > 0$.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<h1 id="conclusion">Conclusion</h1>
<blockquote>
When you can fall for chains of silver, you can fall for chains of gold <br />
You can fall for pretty strangers and the promises they hold <br />
You promised me everything, you promised me thick and thin, yeah <br />
Now you just say "Oh, Romeo, yeah, you know I used to have a scene with him"
</blockquote>
<p>In this blog post, we have seen that linear differential equations are a powerful tool to model how systems change over time in general, and how the love affair between two lovebirds can evolve in particular. We have started out with an isolated Romeo whose feelings either exponentially grow or decay. Romeo then met Juliet, and we have extended the single differential equation to a system of two equations to accommodate this life event.</p>
<p>Love affairs can take many shapes and forms. We have classified those depending on their stability landscape, and seen that linear differential equations can be solved in closed-form by using eigenvectors and eigenvalues or matrix exponentials. In a follow-up blog post, Romeo and Juliet’s love will overcome the shackles of linearity, and we end up with nonlinear differential equations. This will make for more intriguing relationships. We will also add a third lover and study how the dynamics change — it might get chaotic!</p>
<hr />
<p>I would like to thank <a href="https://ryanoisin.github.io/">Oisín Ryan</a> for discussion as well as extensive and very helpful comments on this blog post.</p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="solving-differential-equations-using-matrix-exponentials">Solving differential equations using matrix exponentials</h3>
<!-- <blockquote> -->
<!-- There must be some kind of way outta here <br> -->
<!-- Said the joker to the thief <br> -->
<!-- There's too much confusion <br> -->
<!-- I can't get no relief -->
<!-- </blockquote> -->
<p>Recall that the solution to the single linear differential equation $\frac{\mathrm{d}x}{\mathrm{d}t} = ax$ is:</p>
<script type="math/tex; mode=display">x(t) = x_0 e^{at} \enspace .</script>
<p>The series expansion of $e$ is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
e^{at} &= 1 + at + \frac{(at)^2}{2!} + \frac{(at)^3}{3!} + \ldots \\[.5em]
&= \sum_{k=0}^{\infty} \frac{(at)^k}{k!} \enspace .
\end{aligned} %]]></script>
<p>The idea is to generalize this to allow for a matrix in the exponent. In particular, analogously to the one-dimensional case, we want the system</p>
<script type="math/tex; mode=display">\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} = A\mathbf{x} \enspace ,</script>
<p>to have solutions of the form:</p>
<script type="math/tex; mode=display">\mathbf{x}(t) = \mathbf{x}_0e^{At} \enspace .</script>
<p>First, we generalize the series expansion of $e$ to matrices:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
e^{At} &= I + At + \frac{(At)^2}{2!} + \frac{(At)^3}{3!} + \ldots \\[.5em]
&= \sum_{k=0}^{\infty} \frac{t^k}{k!} A^k \enspace ,
\end{aligned} %]]></script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A^0 &= I \\[.5em]
A^k &= \underbrace{A \cdot A \cdot \ldots \cdot A}_{\text{k times}} \enspace .
\end{aligned} %]]></script>
<p>With this definition, we assume that $\mathbf{x} = \mathbf{x}_0 e^{At}$ and check whether it is true that:</p>
<script type="math/tex; mode=display">\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} = \frac{\mathrm{d}\mathbf{x}_0e^{At}}{\mathrm{d}t} = A \mathbf{x} = A \mathbf{x}_0 e^{At} \enspace .</script>
<p>Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}\mathbf{x}_0e^{At}}{\mathrm{d}t} &= \mathbf{x}_0 \left(0 + A + \frac{2A^2 t}{2!} + \frac{3A^3t^2}{3!} + \ldots\right) \\[.5em]
&= \mathbf{x}_0 \left(A + A^2t + \frac{A^3t^2}{2!} + \ldots\right) \\[.5em]
&= \mathbf{x}_0 A\left(I + At + \frac{A^2t^2}{2!} + \ldots \right) \\[.5em]
&= \mathbf{x}_0 A e^{At} \\[.5em]
&= A \mathbf{x}_0 e^{At} \\[.5em]
&= A \mathbf{x} \enspace ,
\end{aligned} %]]></script>
<p>which shows that, indeed, the matrix exponential of $A$ is a solution to a system of linear differential equations!</p>
<!-- Why do we care? Our motivating example was that we cannot use the eigen decomposition to solve a system of linear differential equations when the eigenvectors are not independent, since the resulting matrix is not invertible. Using the matrix exponential, however, there is no mention of eigenvectors. -->
<p>The matrix exponential solution <em>generalizes</em> the solution using eigendecomposition to non-diagonal matrices $A$. For a diagonalizable matrix $A$, we can connect the approach of using the <a href="https://en.wikipedia.org/wiki/Matrix_exponential">matrix exponential</a> to solve a system of linear differential equations to the eigendecomposition approach we have discussed above. Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
e^{A} &= E e^{\Lambda} E^{-1} \\[.5em]
&= E \begin{pmatrix} e^{\lambda_1} & 0 \\ 0 & e^{\lambda_2}\end{pmatrix} E^{-1} \enspace ,
\end{aligned} %]]></script>
<p>that is, by noting that the matrix exponential of a diagonal matrix given by simply exponentiating each element. This is then the solution in the eigenbasis, which we transform back by multiplying with $E$, as we have done earlier. For diagonalizable matrices, this is a very convenient way of computing the matrix exponential. For general matrices, this is not possible and one needs to rely on other ways of computing the matrix exponential (see Moler & Van Loan, 2003).</p>
<p>To return to our initial problem: we want an expression for the solution of the system described by:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -1 & -1 \\ 0 & -1 \end{pmatrix} \enspace \enspace , %]]></script>
<p>in order to easily compute the trajectory of Romeo and Juliet’s feelings. Assuming that $\mathbf{x}_0 = (1, 1)$, the solution to the system is:</p>
<script type="math/tex; mode=display">\mathbf{x}(t) = \begin{pmatrix} 1 \\ 1\end{pmatrix} e^{At} \enspace ,</script>
<p>which we can implement straightforwardly in R.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'expm'</span><span class="p">)</span><span class="w">
</span><span class="n">solve_linear2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">tmax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># create time steps</span><span class="w">
</span><span class="n">ts</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tmax</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">t</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ts</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expm</span><span class="p">(</span><span class="n">A</span><span class="o">*</span><span class="n">t</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">inits</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">x</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below visualizes a few trajectories of this system that were hithertho uncomputable using the eigendecomposition.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-22-1.png" title="plot of chunk unnamed-chunk-22" alt="plot of chunk unnamed-chunk-22" style="display: block; margin: auto;" /></p>
<!-- [Strogatz mentions](https://youtu.be/QrHRaA93Nrg?list=PLbN57C5Zdl6j_qJA-pARJnKsmROzPnO9V&t=4404) that such degenerate nodes are rather unlikely in the real world, and that fits with our story since $d = 0$ implies that Juliet does not listen to her heart, which contradicts our assumption that she has matured as a lover. -->
<hr />
<h2 id="references">References</h2>
<ul>
<li>Strogatz, S. H. (<a href="https://www.tandfonline.com/doi/abs/10.1080/0025570X.1988.11977342">1988</a>). Love affairs and differential equations. <em>Mathematics Magazine, 6</em>1(1), 35-35.</li>
<li>Strogatz, S. H. (<a href="http://www.stevenstrogatz.com/books/nonlinear-dynamics-and-chaos-with-applications-to-physics-biology-chemistry-and-engineering">2015</a>). Nonlinear Dynamics and Chaos: With applications to Physics, Biology, Chemistry, and Engineering. Colorado, US: Westview Press.</li>
<li>Nonlinear Dynamics and Chaos Lectures by Steven Strogatz, especially <a href="https://www.youtube.com/watch?v=QrHRaA93Nrg&list=PLbN57C5Zdl6j_qJA-pARJnKsmROzPnO9V&index=5">Lecture 5</a>.</li>
<li>Ryan, O., Kuiper, R. M., & Hamaker, E. L. (<a href="https://link.springer.com/chapter/10.1007/978-3-319-77219-6_2">2018</a>). A continuous time approach to intensive longitudinal data: What, Why and How? In K. v. Montfort, J. H. L. Oud, & M. C. Voelkle (Eds.), <em>Continuous time modeling in the behavioral and related sciences</em>. New York: Springer.</li>
<li>Moler, C., & Van Loan, C. (<a href="https://epubs.siam.org/doi/abs/10.1137/S00361445024180?casa_token=ROT7WzzdP14AAAAA:qedJ1cEiWWcPbjq42eSdeKk7LhoAcJYx4eahw3txUDckZS0QCOJhCXaH2nSsuBViH_i8YwBwxQ">2003</a>). Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. <em>SIAM review, 45</em>(1), 3-49.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We have used the love affair between Romeo and Juliet to motivate the classification of a system of two linear differential equations. This was the main goal of the blog post. With this classification in mind, however, one could now study love affairs from a more “substantive” point of view; see Strogatz (1988) and Strogatz (2015, p. 143). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderDifferential equations are a powerful tool for modeling how systems change over time, but they can be a little hard to get into. Love, on the other hand, is humanity’s perennial topic; some even claim it is all you need. In this blog post — inspired by Strogatz (1988, 2015) — I will introduce linear differential equations as a means to study the types of love affairs two people might find themselves in. Do opposites attract? What happens to a relationship if lovers are out of touch with their own feelings? We will answer these and other questions using two coupled linear differential equations. On our journey, we will use graphical as well as mathematical methods to classify the types of relationships this modeling framework can accommodate. In a follow-up blog post, we will also play around with non-linear terms and add a third wheel to the mix, which can lead to chaos — in the technical sense of the term, of course. Excited? Then let’s get started! Introducing Romeo A lovestruck Romeo sang the streets of serenade Laying everybody low with a love song that he made Finds a streetlight, steps out of the shade Says something like, "You and me, babe, how about it?" Romeo is quite the emotional type. Let $R(t)$ denote his feelings at time point $t$. Following common practice, we will usually write $R$ instead of $R(t)$, making the time dependence implicit. The process which describes how Romeo’s feelings change is rather simple: it depends only on Romeo’s current feelings. We write: which is a linear differential equation. Note that this implicitly encodes how Romeo’s feelings change over time, since when we know $R$ at time point $t$, we can compute the direction and speed with which $R$ will change — the derivative denotes velocity. Our goal, however, is to find an explicit, closed-form expression for Romeo’s feelings at time point $t$. In this particular case, we can do this analytically: A differential equation describes how something changes; to kickstart the process, we need an initial condition $R_0$. This allows us to find the constant of integration $C$. In particular, assume that $R = R_0$ at $t = 0$, which leads to: such that: The left two panels of the figure below visualize how Romeo’s feelings change over time for $a > 0$ with initial condition $R_0 = 1$ (top) or $R_0 = -1$ (bottom). The right two panels show how his feelings change for $a < 0$ with $R_0 = 100$ (top) or $R_0 = -100$ (bottom). We conclude: Romeo is a simple guy, albeit with an emotion regulation problem. When the object of his affection is such that $a > 0$, his feelings will either grow exponentially towards mad love if he starts out with a positive first impression ($R_0 > 0$), or grow exponentially towards mad hatred if he starts out with a negative first impression ($R_0 < 0$). On the other hand, if $a < 0$, then regardless of his initial feelings, they will decay exponentially towards indifference. For $R_0 = 0$, Romeo’s feelings never change. For any other initial condition, we have uhindered, exponential growth when $a > 0$; it never stops. For any other initial condition and $a < 0$, we crash down to zero very rapidly. Thus $R = 0$ is a fixed point in both cases, which is stable for $a < 0$ but becomes unstable if $a > 0$. We can visualize this in phase space on a line. The phase space is filled with all possible trajectories because each point can serve as the initial condition. In the next section, a wonderful new episode in Romeo’s life begins: he meets Juliet. Introducing Juliet Juliet says, "Hey, it's Romeo, you nearly gave me a heart attack" He's underneath the window, she's singing, "Hey, la, my boyfriend's back You shouldn't come around here singing up at people like that Anyway, what you gonna do about it?" Life becomes more complicated for Romeo now that Juliet is in his life. It is their first real relationship, and they have much to learn. We start simple. Let $J$ denote Juliet’s feelings for Romeo, and let $R$ denote Romeo’s feelings for Juliet. We can extend our single linear differential equation from above to a system of two linear differential equations: Using the results from above, the solutions to the two differential equations are: where $R(t)$ and $J(t)$ give the trajectories of love for Romeo and Juliet, respectively, and $J_0$ is Juliet’s initial feeling towards Romeo at $t = 0$. In contrast to the one-dimensional phase diagram from above, we now have a two-dimensional picture which is known as a vector field.The Fibonacci sequence and linear algebra2019-07-28T13:30:00+00:002019-07-28T13:30:00+00:00https://fabiandablander.com/r/Fibonacci<p>Leonardo Bonacci, better known as Fibonacci, has influenced our lives profoundly. At the beginning of the $13^{th}$ century, he introduced the Hindu-Arabic numeral system to Europe. Instead of the Roman numbers, where <strong>I</strong> stands for one, <strong>V</strong> for five, <strong>X</strong> for ten, and so on, the Hindu-Arabic numeral system uses position to index magnitude. This leads to much shorter expressions for large numbers.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>While the history of the <a href="https://thonyc.wordpress.com/2017/02/10/the-widespread-and-persistent-myth-that-it-is-easier-to-multiply-and-divide-with-hindu-arabic-numerals-than-with-roman-ones/">numerical system</a> is fascinating, this blog post will look at what Fibonacci is arguably most well known for: the <em>Fibonacci sequence</em>. In particular, we will use ideas from linear algebra to come up with a closed-form expression of the $n^{th}$ Fibonacci number<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. On our journey to get there, we will also gain some insights about recursion in R.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<h1 id="the-rabbit-puzzle">The rabbit puzzle</h1>
<p>In Liber Abaci, Fibonacci poses the following question (paraphrasing):</p>
<blockquote>
<p>Suppose we have two newly-born rabbits, one female and one male. Suppose these rabbits produce another pair of female and male rabbits after one month. These newly-born rabbits will, in turn, also mate after one month, producing another pair, and so on. Rabbits never die. How many pairs of rabbits exist after one year?</p>
</blockquote>
<p>The Figure below illustrates this process. Every point denotes one rabbit pair over time. To indicate that every newborn rabbit pair needs to wait one month before producing new rabbits, rabbits that are not fertile yet are coloured in grey, while rabbits ready to procreate are coloured in red.</p>
<div style="text-align:center;">
<img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" />
</div>
<p>We can derive a linear recurrence relation that describes the Fibonacci sequence. In particular, note that rabbits never die. Thus, at time point $n$, all rabbits from time point $n - 1$ carry over. Additionally, we know that every fertile rabbit pair will produce a new rabbit pair. However, they have to wait one month, so that the amount of fertile rabbits equals the amount of rabbits at time point $n - 2$. Resultingly, the Fibonacci sequence {$F_n$}$_{n=1}^{\infty}$ is:</p>
<script type="math/tex; mode=display">F_n = F_{n-1} + F_{n-2} \enspace ,</script>
<p>for $n \geq 3$ and $F_1 = F_2 = 1$. Before we derive a closed-form expression that computes the $n^{th}$ Fibonacci number directly, in the next section, we play around with alternative, more straightforward solutions in R.</p>
<h1 id="implementation-in-r">Implementation in R</h1>
<p>We can write a wholly inefficient, but beautiful program to compute the $n^{th}$ Fibonacci number:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-2</span><span class="p">))</span></code></pre></figure>
<p>R takes roughly 5 seconds to compute the $30^{\text{th}}$ Fibonacci number; computing the $40^{\text{th}}$ number exhausts my patience. This recursive solution is not particularly efficient because R executes the function an unnecessary amount of times. For example, the call tree for <em>fib(5)</em> is:</p>
<ul>
<li><em>fib(5)</em></li>
<li><em>fib(4)</em> + <em>fib(3)</em></li>
<li>(<em>fib(3)</em> + <em>fib(2)</em>) + (<em>fib(2)</em> + <em>fib(1)</em>)</li>
<li>((<em>fib(2)</em> + <em>fib(1)</em>) + (<em>fib(1)</em> + <em>fib(0)</em>)) + ((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>)</li>
<li>((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>) + (<em>fib(1)</em> + <em>fib(0)</em>)) + ((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>)</li>
</ul>
<p>which shows that <em>fib(2)</em> was called three times. This is not necessary, as we can store the outcome of this function call instead of recomputing it every time. This technique is called <a href="https://en.wikipedia.org/wiki/Memoization">memoization</a> (see also the R package <a href="https://github.com/r-lib/memoise">memoise</a>). Implementing this leads to:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_mem</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cache</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">inside</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">cache</span><span class="p">))</span><span class="w">
</span><span class="n">fib</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">inside</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cache</span><span class="p">[[</span><span class="nf">as.character</span><span class="p">(</span><span class="n">n</span><span class="p">)]]</span><span class="w"> </span><span class="o"><<-</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-2</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">cache</span><span class="p">[[</span><span class="nf">as.character</span><span class="p">(</span><span class="n">n</span><span class="p">)]]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>This computes the $1000^{th}$ Fibonacci in a tenth of a second. We can, of course, write this sequentially, and also store all intermediate Fibonacci numbers. This also avoids memory issues brought about by the recursive implementation. Interestingly, although this algorithm seems like it should be $O(n)$, it is actually $O(n^2)$ since we are adding increasingly large numbers (for more on this, see <a href="https://catonmat.net/linear-time-fibonacci">here</a>).</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">num</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="m">-2</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">num</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The first 30 Fibonacci numbers are: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040.</p>
<p>This is a rapid increase, as made apparent by the left Figure below. The Figure on the right shows that there is structure in how the sequence grows.</p>
<p><img src="/assets/img/2019-07-28-Fibonacci.Rmd/unnamed-chunk-8-1.png" title="plot of chunk unnamed-chunk-8" alt="plot of chunk unnamed-chunk-8" style="display: block; margin: auto;" /></p>
<p>We will return to the structure in growth at the end of the blog post. First, we need to derive a closed-form expression of the $n^{th}$ Fibonacci number. In the next section, we take a step towards that by realizing that diagonal matrices make for easier computations.</p>
<h1 id="diagonal-matrices-are-good">Diagonal matrices are good</h1>
<p>Our goal is to get a closed form expression of the $n^{th}$ Fibonacci number. The first thing to note is that, due to linear recursion, we can view the Fibonacci numbers as applying a linear map. In particular, define $T \in \mathcal{L}(\mathbb{R}^2)$ by:</p>
<script type="math/tex; mode=display">T(x, y) = (y, x + y) \enspace .</script>
<p>We note that:</p>
<script type="math/tex; mode=display">T^n(0, 1) = (F_n, F_{n+1}) \enspace ,</script>
<p>which we will prove by induction. In particular, note that the base case $n = 1$:</p>
<script type="math/tex; mode=display">T^1(0, 1) = (1, 0 + 1) = (1, 1) = (F_1, F_2) \enspace ,</script>
<p>does in fact give the first two Fibonacci numbers. Now for the induction step: we assume that this holds for an arbitrary $n$, and we show that it holds for $n + 1$ using the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T^n(0, 1) &= (F_n, F_{n+1}) \\[1em]
T(T^n(0, 1)) &= T(F_n, F_{n+1}) \\[1em]
T^{n+1}(0, 1) &= (F_{n+1}, F_n + F_{n+1}) \\[1em]
T^{n+1}(0, 1) &= (F_{n+1}, F_{n+2}) \enspace .
\end{aligned} %]]></script>
<p>The last equality follows from the definition of the Fibonacci sequence, i.e., the fact that any number is equal to the sum of the previous two numbers. The matrix of this linear map with respect to the standard basis is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
A \equiv \mathcal{M}(T) = \begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix} \enspace , %]]></script>
<p>since $T(1, 0) = (0, 1)$ and $T(0, 1) = (1, 1)$. Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} y \\ x + y \end{pmatrix} \enspace . %]]></script>
<p>In the sequential R code for computing the Fibonacci numbers, we have applied the linear map $n$ times, which gave us the Fibonacci number we were interested in. We can write this in matrix form:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>If you were to compute, say, the $3^{th}$ Fibonacci number using this equation, you would have to multiply $A$ three times with itself. Now assume you had something like:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Using the above equation, the matrix powers would become trivial:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix}^n = \begin{pmatrix} \lambda_1^n & 0 \\ 0 & \lambda_2^n \end{pmatrix} \enspace . %]]></script>
<p>There would be no need to repeatedly engage in matrix multiplication; instead, we would arrive at the $n^{th}$ Fibonacci number using only scalar multiplication! Our task is thus as follows: find a new matrix for the linear map which is diagonal. To solve this, we will need eigenvalues and eigenvectors.</p>
<h1 id="finding-eigenvalues-and-eigenvectors">Finding eigenvalues and eigenvectors</h1>
<p>An eigenvector-eigenvalue pair $(v, \lambda)$ satisfies for $v \neq 0$ that:</p>
<script type="math/tex; mode=display">Tv = \lambda v \enspace ,</script>
<p>which means that for a particular vector $v$, the linear map only stretches the vector by a constant $\lambda$. Here’s the key: using the eigenvectors as basis, the matrix of the linear map is diagonal. This is because the matrix of our linear map, $A$, is defined by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Tv_1 &= A_{11} v_1 + A_{21} v_2 \\
Tv_2 &= A_{12} v_1 + A_{22} v_2 \enspace .
\end{aligned} %]]></script>
<p>Now since the basis consists only of eigenvectors, we know that $Tv_1 = \lambda v_1$ and $Tv_2 = \lambda v_2$, which implies that $A_{11} = \lambda_1$ and $A_{21} = 0$, as well as $A_{12} = 0$ and $A_{22} = \lambda_2$. For a wonderful explanation of eigenvalues and eigenvectors, see <a href="https://www.youtube.com/watch?v=PFDu9oVAE-g">this video</a> by 3Blue1Brown.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p>In order to find the eigenvalues and eigenvectors, note that the linear map satisfies the following two equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T(x, y) &= \lambda (x, y) \\[1em]
T(x, y) &= (y, x + y) \enspace .
\end{aligned} %]]></script>
<p>This leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda x &= y \\[1em]
\lambda y &= x + y \enspace .
\end{aligned} %]]></script>
<p>We substitute the first expression into the second one, yielding:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda^2 x &= x + y \\[1em]
(\lambda^2 - 1)x &= y \enspace ,
\end{aligned} %]]></script>
<p>which we now substitute into the first equation, which results in:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda x &= (\lambda^2 - 1)x\\[1em]
0 &= \lambda^2 - \lambda - 1\enspace .
\end{aligned} %]]></script>
<p>We can now apply the <em>quadratic formula</em> or “Mitternachtsformel”, as it is called in parts of Germany because students should know the formula when they are roused from sleep at midnight. We are neither in Germany, nor is it midnight, nor can I actually remember the formula, so let’s quickly derive it for our problem:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda^2 - \lambda - 1 &= 0 \\[1em]
\lambda^2 - \lambda &= 1 \\[1em]
4\lambda^2 - 4\lambda &= 4 \\[1em]
4\lambda^2 - 4\lambda + 1&= 4 + 1 \\[1em]
(2\lambda - 1)^2&= 4 + 1 \\[1em]
2\lambda - 1 &= \pm \sqrt{4 + 1} \\[1em]
\lambda &= \frac{1 \pm \sqrt{5}}{2} \enspace .
\end{aligned} %]]></script>
<p>Now that we have found both eigenvalues, we go hunting for the eigenvectors! We put the eigenvalue into the equations from above:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{1 \pm \sqrt{5}}{2} x &= y \\[1em]
\frac{1 \pm \sqrt{5}}{2} y &= x + y \enspace .
\end{aligned} %]]></script>
<p>If we set $x = 1$, then $y = \frac{1 \pm \sqrt{5}}{2}$. Thus, two eigenvectors are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
v_1 &= \left(1, \frac{1 + \sqrt{5}}{2}\right) \\[1em]
v_2 &= \left(1, \frac{1 - \sqrt{5}}{2}\right) \enspace .
\end{aligned} %]]></script>
<p>As a sanity check to see whether this is indeed true, we check whether $Tv_1 = \lambda_1 v_1$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Tv_1 &= \left(\frac{1 + \sqrt{5}}{2}, 1 + \frac{1 + \sqrt{5}}{2}\right) \\[1em]
\lambda v_1 &= \frac{1 + \sqrt{5}}{2} \left(1, \frac{1 + \sqrt{5}}{2}\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \left(\frac{1 + \sqrt{5}}{2}\right)^2\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \frac{1 + 2\sqrt{5} + 5}{4}\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \frac{3}{2} + \frac{\sqrt{5}}{2} \right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, 1 + \frac{1 + \sqrt{5}}{2} \right) \enspace ,
\end{aligned} %]]></script>
<p>which shows that the two expression are equal. Moreover, the dot product of the two eigenvectors is zero, which means that the two eigenvectors are linearly independent (as they should be). In the next section, we will find that <a href="https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation">the same territory can be described by different maps</a>.</p>
<h1 id="change-of-basis">Change of basis</h1>
<p>Now that we have found the eigenvalues and eigenvectors, we can create the matrix $D$ of the linear map $T$ which is diagonal with respect to the basis of eigenvectors:</p>
<script type="math/tex; mode=display">% <![CDATA[
D = \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \enspace . %]]></script>
<p>We are not done yet, however. Note that $D$ is the matrix of the linear map $T$ with respect to the basis that consists of both eigenvectors $v_1$ and $v_2$, <em>not</em> with respect to the standard basis. We have changed our coordinate system — our map — as indicated by the Figure below; the black coloured vectors are the standard basis vectors while the vectors coloured in red are our new basis vectors.</p>
<!-- <div style = "float: left; padding: 10px 10px 10px 0px;"> -->
<!-- ![](/assets/img/change-of-basis.png) -->
<div style="text-align:center;">
<img src="../assets/img/change-of-basis.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" />
</div>
<!-- </div> -->
<p>To build some intuition, let’s play around with representing $\omega$ in both the standard basis and our new eigenbasis. Any vector is a linear combination of the basis vectors. Let $a_1$ and $a_2$ be the coefficients for the standard basis such that:</p>
<script type="math/tex; mode=display">\omega = a_1 \begin{pmatrix} 1 \\ 0 \end{pmatrix} + a_2 \begin{pmatrix} 0 \\ 1 \end{pmatrix} \enspace .</script>
<p>Now because I have drawn it earlier, I know that $a_1 = -1$ and $a_2 = 0.3$. This is the representation of $\omega$ in the standard basis. How do we represent it in our eigenbasis? Well, using the eigenbasis the vector $\omega$ is still a linear combination of the basis vectors, but with different coefficients; denote them as $b_1$ and $b_2$. We thus have:</p>
<script type="math/tex; mode=display">\omega = b_1 \begin{pmatrix} 1 \\ \frac{1 + \sqrt{5}}{2} \end{pmatrix} + b_2 \begin{pmatrix} 1 \\ \frac{1 - \sqrt{5}}{2} \end{pmatrix} = a_1 \begin{pmatrix} 1 \\ 0 \end{pmatrix} + a_2 \begin{pmatrix} 0 \\ 1 \end{pmatrix} \enspace .</script>
<p>If we write this in matrix form, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} b_1 \\ b_2 \end{pmatrix} &= \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix}\\[1em]
\begin{pmatrix} b_1 \\ b_2 \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^{-1} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>Thus, we can represent a vector $a$ with basis $S$ in our new basis $E$ by computing:</p>
<script type="math/tex; mode=display">b = E^{-1} S \, a \enspace .</script>
<p>In our eigenbasis, the vector $\omega$ has the coordinates:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">lambda1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">lambda2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda2</span><span class="p">))</span><span class="w">
</span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">a</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] -0.1422291
## [2,] -0.8577709</code></pre></figure>
<p>This means we have the representation:</p>
<script type="math/tex; mode=display">\omega = -0.14 \begin{pmatrix} 1 \\ \frac{1 + \sqrt{5}}{2} \end{pmatrix} - 0.86 \begin{pmatrix} 1 \\ \frac{1 - \sqrt{5}}{2} \end{pmatrix} \enspace ,</script>
<p>which makes intuitive sense when you look at the Figure above. For another beautiful linear algebra video by 3Blue1Brown, this time about changing bases, see <a href="https://www.youtube.com/watch?v=P2LTAUO1TdA&t=598s">here</a>. In the next section, we will use what we have learned above to express the $n^{th}$ Fibonacci number in closed-form.</p>
<h1 id="closed-form-fibonacci">Closed-form Fibonacci</h1>
<p>Recall from above that our solution to finding the $n^{th}$ Fibonacci number in matrix form is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Now, we have swapped the non-diagonal matrix $A$ with the diagonal matrix $D$ by changing the basis from the standard basis to the eigenbasis. However, the vector $(0, 1)^T$ is still in the standard basis! In order to change its representation to the eigenbasis, we multiply it with $E^{-1}$, as discussed above. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^{-1} \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Let’s use this to compute, say, the $10^{th}$ Fibonacci number (which is 55) in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">D</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">lambda1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda2</span><span class="p">))</span><span class="w">
</span><span class="n">D</span><span class="o">^</span><span class="m">10</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] 55.003636123
## [2,] -0.003636123</code></pre></figure>
<p>Ha! This didn’t quite work, did it? We got the answer for $F_{10}$ roughly when rounding, but $F_{11}$ is completely off. What did we miss? Well, this is in fact the correct answer — it is just in the wrong basis! We have to convert this from the eigenbasis to the standard basis. To do this, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
b &= E^{-1} S \, a \\
E b &= S \, a \\
E b &= a \enspace ,
\end{aligned} %]]></script>
<p>since $S$ is the identity matrix. Thus, all we have to do is to multiply with $E$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">D</span><span class="o">^</span><span class="m">10</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] 55
## [2,] 89</code></pre></figure>
<p>which is the correct solution. To get the closed-form solution algebraically, we first invert the matrix $E$:</p>
<script type="math/tex; mode=display">% <![CDATA[
E^{-1} = -\frac{1}{\sqrt{5}} \begin{pmatrix} \frac{1 - \sqrt{5}}{2} & -1 \\ - \frac{1 + \sqrt{5}}{2} & 1\end{pmatrix} \enspace , %]]></script>
<p>and we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n -\frac{1}{\sqrt{5}} \begin{pmatrix} \frac{1 - \sqrt{5}}{2} & -1 \\ - \frac{1 + \sqrt{5}}{2} & 1\end{pmatrix}\begin{pmatrix} 0 \\ 1 \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n \begin{pmatrix} -1 \\ 1 \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} -\left(\frac{1 + \sqrt{5}}{2}\right)^n \\ \left(\frac{1 - \sqrt{5}}{2}\right)^n \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} -\left(\frac{1 + \sqrt{5}}{2}\right)^n + \left(\frac{1 - \sqrt{5}}{2}\right)^n \\ -\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} + \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \end{pmatrix} \\[1em]
&= \frac{1}{\sqrt{5}} \begin{pmatrix} \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \\ \left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>The closed-form expression of the $n^{th}$ Fibonacci number is thus given by:</p>
<script type="math/tex; mode=display">F_n = \frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \right] \enspace .</script>
<p>We verify this in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_closed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="m">1</span><span class="o">/</span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(((</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">((</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">fib_closed</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">30</span><span class="p">)))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1 1 2 3 5 8 13 21 34 55
## [11] 89 144 233 377 610 987 1597 2584 4181 6765
## [21] 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040</code></pre></figure>
<h1 id="the-golden-ratio">The golden ratio</h1>
<p>In the above section, we have derived a closed-form expression of the $n^{th}$ Fibonacci number. In this section, we return to an observation we have made at the beginning: there is structure in how the Fibonacci numbers grow. Johannes Kepler, after whom the university in my home town is named, (re)discovered that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lim_{n \rightarrow \infty} \frac{F_{n+1}}{F_n} &= \lim_{n \rightarrow \infty} \frac{\frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \right]}{\frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \right]} \\[1em]
&= \lim_{n \rightarrow \infty} \frac{\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1}}{\left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n} \\[1em]
&= \frac{\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1}}{\left(\frac{1 + \sqrt{5}}{2}\right)^n} \\[1em]
&= \frac{1 + \sqrt{5}}{2} \approx 1.618 \enspace ,
\end{aligned} %]]></script>
<p>which is the <a href="https://en.wikipedia.org/wiki/Golden_ratio">golden ratio</a>. The golden ratio $\phi$ denotes that the ratio of two parts is equal to the ratio of the sum of the parts to the larger part, i.e., for $a > b > 0$:</p>
<script type="math/tex; mode=display">\phi \equiv \frac{a}{b} = \frac{a + b}{a} \enspace .</script>
<p>We have observed this empirically in the first Figure, which visualized the differences in the log of two consecutive Fibonacci numbers, and which yielded already for small $n$:</p>
<script type="math/tex; mode=display">\text{log} \, F_{n+1} - \text{log} \, F_n = \text{log} \, \frac{F_{n + 1}}{F_n} \approx 0.4812 \enspace ,</script>
<p>which exponentiated yields the golden ratio. Observe that $\left(\frac{1 - \sqrt{5}}{2}\right)^n$ goes to zero very quickly as $n$ grows so that we can compute the $n^{th}$ Fibonacci number by:</p>
<script type="math/tex; mode=display">F_n = \left \lfloor \frac{1}{\sqrt{5}} \phi^n \right \rceil \enspace ,</script>
<p>where we simply round to the nearest integer. To finally answer Fibonacci’s puzzle:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_golden</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="nf">round</span><span class="p">(((</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w">
</span><span class="n">fib_golden</span><span class="p">(</span><span class="m">12</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 144</code></pre></figure>
<p>After a mere twelve months of incest, there are 144 rabbit pairs!<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></p>
<p>There are various <a href="https://en.wikipedia.org/wiki/Generalizations_of_Fibonacci_numbers">generalizations</a> of the Fibonacci sequence. One such generalization is to allow higher orders $k$ in the sequence, which for $k = 3$ is known as the <a href="https://www.youtube.com/watch?v=fMJflV_GUpU">Tribonacci sequence</a>. Our approach for $k = 2$ can be straightforwardly generalized to account for any order $k$ (if you want to go down a rabbit hole, see for example <a href="https://math.stackexchange.com/questions/41667/fibonacci-tribonacci-and-other-similar-sequences">this</a>).</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have taken a detailed look at the Fibonacci sequence. In particular, we saw that it is the answer to a puzzle about procreating rabbits, and how to speed up a recursive algorithm for finding the $n^{th}$ Fibonacci number. We then used ideas from linear algebra to arrive at a closed-form expression of the $n^{th}$ Fibonacci number. Specifically, we have noted that the Fibonacci sequence is a linear recurrence relation — it can be viewed as repeatedly applying a linear map. With this insight, we observed that the matrix of the linear map is non-diagonal, which makes repeated execution tedious; diagonal matrices, on the other hand, are easy to multiply. We arrived at a diagonal matrix by changing the basis from the standard basis to the basis of eigenvectors, which led to a diagonal matrix of eigenvalues for the linear map. With this representation, the $n^{th}$ Fibonacci number is available in closed-form. In order to get it into the standard basis, we had to change basis back from the eigenbasis. We also saw how the Fibonacci numbers relate to the golden ratio $\phi$.</p>
<hr />
<p>I would like to thank Don van den Bergh, Jonas Haslbeck, and Sophia Crüwell for helpful comments on this blog post.</p>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is the main reason why the Hinu-Arabic numeral system took over. The belief that it is easier to multiply and divide using Hindu-Arabic numerals is <a href="https://thonyc.wordpress.com/2017/02/10/the-widespread-and-persistent-myth-that-it-is-easier-to-multiply-and-divide-with-hindu-arabic-numerals-than-with-roman-ones/">incorrect</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This blog post is inspired by exercise 16 on p. 161 in <a href="http://linear.axler.net/">Linear Algebra Done Right</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I have learned that there is already (very good) ink spilled on this topic, see for example <a href="https://bosker.wordpress.com/2011/04/29/the-worst-algorithm-in-the-world/">here</a> and <a href="https://bosker.wordpress.com/2011/07/27/computing-fibonacci-numbers-using-binet%E2%80%99s-formula/">here</a>. A nice essay is also <a href="https://opinionator.blogs.nytimes.com/2012/09/24/proportion-control/?mtrref=undefined&gwh=C0500419D79A9E5B64F17ABC970C5125&gwt=pay">this</a> piece by Steve Strogatz, who, by the way, wrote a wonderful book called <a href="https://www.goodreads.com/book/show/354421.Sync">Sync</a>. He’s also been on Sean Carroll’s Mindscape podcast, listen <a href="https://www.preposterousuniverse.com/podcast/2019/04/08/episode-41-steven-strogatz-on-synchronization-networks-and-the-emergence-of-complex-behavior/">here</a>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>If you forget everything that is written in this blog post, but through it were made aware of the videos by 3Blue1Brown (or <a href="https://www.numberphile.com/podcast/3blue1brown">Grant Sanderson</a>, as he is known in the real world), then I consider this blog post a success. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The downside of the closed-form solution is that it is difficult to calculate the power of the square root with high accuracy. In fact, <em>fib_golden</em> is incorrect for $n > 70$. Our <em>fib_mem</em> implementation is also incorrect, but only for $n > 93$. (I’ve compared it against Fibonacci numbers calculated from <a href="https://www.miniwebtool.com/list-of-fibonacci-numbers/?number=100">here</a>). <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderLeonardo Bonacci, better known as Fibonacci, has influenced our lives profoundly. At the beginning of the $13^{th}$ century, he introduced the Hindu-Arabic numeral system to Europe. Instead of the Roman numbers, where I stands for one, V for five, X for ten, and so on, the Hindu-Arabic numeral system uses position to index magnitude. This leads to much shorter expressions for large numbers.1 While the history of the numerical system is fascinating, this blog post will look at what Fibonacci is arguably most well known for: the Fibonacci sequence. In particular, we will use ideas from linear algebra to come up with a closed-form expression of the $n^{th}$ Fibonacci number2. On our journey to get there, we will also gain some insights about recursion in R.3 The rabbit puzzle In Liber Abaci, Fibonacci poses the following question (paraphrasing): Suppose we have two newly-born rabbits, one female and one male. Suppose these rabbits produce another pair of female and male rabbits after one month. These newly-born rabbits will, in turn, also mate after one month, producing another pair, and so on. Rabbits never die. How many pairs of rabbits exist after one year? The Figure below illustrates this process. Every point denotes one rabbit pair over time. To indicate that every newborn rabbit pair needs to wait one month before producing new rabbits, rabbits that are not fertile yet are coloured in grey, while rabbits ready to procreate are coloured in red. We can derive a linear recurrence relation that describes the Fibonacci sequence. In particular, note that rabbits never die. Thus, at time point $n$, all rabbits from time point $n - 1$ carry over. Additionally, we know that every fertile rabbit pair will produce a new rabbit pair. However, they have to wait one month, so that the amount of fertile rabbits equals the amount of rabbits at time point $n - 2$. Resultingly, the Fibonacci sequence {$F_n$}$_{n=1}^{\infty}$ is: for $n \geq 3$ and $F_1 = F_2 = 1$. Before we derive a closed-form expression that computes the $n^{th}$ Fibonacci number directly, in the next section, we play around with alternative, more straightforward solutions in R. Implementation in R We can write a wholly inefficient, but beautiful program to compute the $n^{th}$ Fibonacci number: This is the main reason why the Hinu-Arabic numeral system took over. The belief that it is easier to multiply and divide using Hindu-Arabic numerals is incorrect. ↩ This blog post is inspired by exercise 16 on p. 161 in Linear Algebra Done Right. ↩ I have learned that there is already (very good) ink spilled on this topic, see for example here and here. A nice essay is also this piece by Steve Strogatz, who, by the way, wrote a wonderful book called Sync. He’s also been on Sean Carroll’s Mindscape podcast, listen here. ↩Spurious correlations and random walks2019-06-29T10:00:00+00:002019-06-29T10:00:00+00:00https://fabiandablander.com/r/Spurious-Correlation<p>The number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: <em>completely independent processes can be correlated substantially</em>.</p>
<h2 id="ar1-processes-and-random-walks">AR(1) processes and random walks</h2>
<p>Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as:</p>
<script type="math/tex; mode=display">Y_t = \phi \, Y_{t-1} + \epsilon_t \enspace ,</script>
<p>where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a <em>random walk</em>. We can simulate from these using the following code:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">simulate_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">t</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="p">[</span><span class="n">t</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">phi</span><span class="o">*</span><span class="n">y</span><span class="p">[</span><span class="n">t</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">y</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The following R code simulates data from three independent random walks and an AR(1) process with $\phi = 0.5$; the Figure below visualizes them.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>As we can see from the plot, the AR(1) process seems pretty well-behaved. This is in contrast to the three random walks: all of them have an initial upwards trend, after which the red line keeps on growing, while the blue line makes a downward jump. In contrast to AR(1) processes, random walks are <em>not stationary</em> since their variance is not constant across time. For some very good lecture notes on time-series analysis, see <a href="https://www.economodel.com/time-series-analysis">here</a>.</p>
<h2 id="spurious-correlations-of-random-walks">Spurious correlations of random walks</h2>
<p>If we look at the correlations of these three random walks across time points, we find that they are substantial:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="nf">round</span><span class="p">(</span><span class="n">cor</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">red</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw1</span><span class="p">,</span><span class="w"> </span><span class="n">green</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw2</span><span class="p">,</span><span class="w"> </span><span class="n">blue</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw3</span><span class="p">)),</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## red green blue
## red 1.00 -0.49 -0.29
## green -0.49 1.00 0.59
## blue -0.29 0.59 1.00</code></pre></figure>
<p>I hope that this is at least a little bit of a shock. Upon reflection, however, it is clear that we are blundering: computing the correlation across time ignores the dependency between data points that is so typical of time-series data. To get more data about what is going on, we conduct a small simulation study.</p>
<p>In particular, we want to get an intuition of how this spurious correlation behaves with increasing sample sizes. We therefore simulate two independent random walks for sample sizes $n \in [50, 100, 200, 500, 1000, 2000]$ and compute their Pearson correlation, the test-statistic, and whether $p < \alpha$, where we set $\alpha$ to some an arbitrary value, say $\alpha = 0.05$. We repeated this 100 times and report the average of these quantities.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">)</span><span class="w">
</span><span class="n">comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ns</span><span class="p">)</span><span class="w">
</span><span class="n">ncomb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">comb</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncomb</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ix'</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'cor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">,</span><span class="w"> </span><span class="s1">'pval'</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ncomb</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cor.test</span><span class="p">(</span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">p.value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">tab</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">avg_abs_corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">cor</span><span class="p">)),</span><span class="w">
</span><span class="n">avg_abs_tstat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">tstat</span><span class="p">)),</span><span class="w">
</span><span class="n">percent_sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">pval</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">data.frame</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">tab</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## n avg_abs_corr avg_abs_tstat percent_sig
## 1 50 0.41 3.57 0.71
## 2 100 0.46 6.58 0.85
## 3 200 0.45 8.88 0.85
## 4 500 0.37 10.63 0.86
## 5 1000 0.41 17.05 0.88
## 6 2000 0.39 23.39 0.97</code></pre></figure>
<p>We observe that the average absolute correlation is very similar across $n$, but the test statistic grows with increased $n$, which naturally results in many more false rejections of the null hypothesis of no correlation between the two random walks.</p>
<p>To my knowledge, Granger and Newbold (1974) were the first to point out this puzzling fact.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> They regress one random walk onto the other instead of computing the Pearson correlation. (Note that the test statistic is the same). In a regression setting, we write:</p>
<script type="math/tex; mode=display">Y = \beta_0 + \beta_1 X + \epsilon \enspace ,</script>
<p>where we assume that $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ (see also a <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">previous</a> blog post). This is evidently violated when performing linear regression on two random walks, as demonstrated by the residual plot below.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Similar as above, we can have an AR(1) process on the residuals:</p>
<script type="math/tex; mode=display">\epsilon_t = \delta \epsilon_{t-1} + \eta_t \enspace ,</script>
<p>and test whether $\delta = 0$. We can do so using the <a href="https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic">Durbin-Watson test</a>, which yields:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">car</span><span class="o">::</span><span class="n">durbinWatsonTest</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## lag Autocorrelation D-W Statistic p-value
## 1 0.9357562 0.08623868 0
## Alternative hypothesis: rho != 0</code></pre></figure>
<p>This indicates substantial autocorrelation, violating our modeling assumption of independent residuals. In the next section, we look at the deeper mathematical reasons for why we get such spurious correlation. In the Post Scriptum, we relax the constraint that $\phi = 1$ and look at how spurious correlation behaves for AR(1) processes.</p>
<!-- In the next section, we will look more formally into the curious fact that two independent random walks are correlated. To understand why even with large $n$ the estimation goes awry, we have to make an excursion into asymptotia. -->
<h2 id="inconsistent-estimation">Inconsistent estimation</h2>
<p>The simulation results from the random walk simulations showed that the average (absolute) correlation stays roughly constant, while the test statistic increases with $n$. This indicates a problem with our estimator for the correlation. Because it is slightly easier to study, we focus on the regression parameter $\beta_1$ instead of the Pearson correlation. <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">Recall</a> that our regression estimate is</p>
<script type="math/tex; mode=display">\hat{\beta}_1 = \frac{\sum_{t=1}^N (x_t - \bar{x})(y_t - \bar{y})}{\sqrt{\sum_{t=1}^N (x_t - \bar{x})^2 \sum_{t=1}^N (y_t - \bar{y})^2}} \enspace ,</script>
<p>where $\bar{x}$ and $\bar{y}$ are the empirical means of the realizations $x_t$ and $y_t$ of the AR(1) processes $X_t$ and $Y_t$, respectively. The test statistic associated with the null hypothesis $\beta_1 = 0$ is</p>
<script type="math/tex; mode=display">t_{\text{statistic}} := \frac{\hat{\beta_1} - 0}{se(\hat{\beta_1})} = \frac{\hat{\beta_1}}{\hat{\sigma} / \sqrt{\sum_{t=1}^N (x_t - \bar{x})^2}} \enspace ,</script>
<p>where $\hat{\sigma}$ is the estimated standard deviation of the error. In simple linear regression, the test statistic follows a t-distribution with $n - 2$ degrees of freedom (it takes two parameters to fit a straight line). In the case of independent random walks, however, the test statistic does not have a limiting distribution; in fact, as $n \rightarrow \infty$, the distribution of $t_{\text{statistic}}$ diverges (Phillips, 1986).</p>
<p>To get an intuition for this, we plot the bootstrapped sampling distributions for $\beta_1$ and $t_{\text{statistic}}$, both for the case of regressing one independent AR(1) process onto another, and for random walk regression.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">regress_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">summary</span><span class="p">(</span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="p">)))[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bootstrap_limit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">ns</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'b1'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">)</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">ns</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">regress_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">coefs</span><span class="p">)</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ix</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">)</span><span class="w">
</span><span class="n">res_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bootstrap_limit</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w">
</span><span class="n">res_rw</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bootstrap_limit</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span></code></pre></figure>
<p>The Figure below illustrates how things go wrong when regressing one independent random walk onto the other. In contrast to the estimate for the AR(1) regression, the estimate $\hat{\beta}_1$ does not decrease in the case of a random walk regression. Instead, it stays roughly within $[-0.75, 0.75]$ across all $n$. This shines further light on the initial simulation results that the average correlation stays roughly the same. Moreover, in contrast AR(1) regression for which the distribution of the test statistic does not change, the distribution of the test statistic for the random walk regression seems to diverge. This explains why we the proportion of false positives increases with $n$.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-9-1.png" title="plot of chunk unnamed-chunk-9" alt="plot of chunk unnamed-chunk-9" style="display: block; margin: auto;" /></p>
<p>Rigorous arguments of the above statements can be found in Phillips (1986) and Hamilton (1994, pp. 577).<sup id="fnref:4"><a href="#fn:4" class="footnote">3</a></sup> The explanations feature some nice asympotic arguments which I would love go into in detail; however, I’m currently in Santa Fe for a summer school that has a very tightly packed programme. On that note: it is <a href="https://www.santafe.edu/engage/learn/schools/sfi-complex-systems-summer-school">very, very cool</a>. You should definitely apply next year! In addition to the stimulating lectures, wonderful people, and exciting projects, the surroundings are stunning<sup id="fnref:5"><a href="#fn:5" class="footnote">4</a></sup>.</p>
<div style="text-align:center;">
<img src="../assets/img/IAIA.jpeg" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="720" height="620" />
</div>
<!-- ### Brownian Motion -->
<!-- The type of random walk we focused on in this blog post takes place in discrete, equidistant time steps.[^3] If we take the limit of $n \rightarrow \infty$, however, we move from a discrete time random walk to a continuous time Brownian motion. The gist of the argument is to make the difference $\Delta Y_t$ between time points $Y_{t+1}$ and $Y_t$ infinitesimally small. Recall that the Gaussian distribution is [closed under addition](https://fdabl.github.io/statistics/Two-Properties.html), and that -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- Y_t &= \sum_{i=1}^t \eta_i \sim \mathcal{N}(0, t \cdot \sigma^2) \enspace \\[1em] -->
<!-- \Delta Y_t &= Y_{t+1} - Y_{t} = \sum_{i=1}^{t+1} \eta_i - \sum_{j=1}^t \eta_j = \eta_t \sim \mathcal{N}(0, \sigma^2) \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- We may cut $\eta_t$ into $n$ pieces -->
<!-- $$ -->
<!-- \eta_t = \eta_{1t} + \eta_{2t} + \ldots + \eta_{nt} \enspace , -->
<!-- $$ -->
<!-- where $\eta_{it} \sim \mathcal{N}(0, \frac{1}{n})$. Therefore, as we increase $n$, the discrete-time process is defined at a finer and finer grid. For $n \rightarrow \infty$, this results into the continuous-time Brownian motion, which we denote as $W(t)$, where $W: t \in [0, 1] \rightarrow \mathbb{R}$. -->
<!-- ## Solutions -->
<!-- Hamilton (1994, p. 562) discusses three solutions. One of them is to *difference* the data before doing the regression, i.e., -->
<!-- $$ -->
<!-- \Delta Y_t = \beta_0 + \beta_1 \Delta X_t + \epsilon_t \enspace , -->
<!-- $$ -->
<!-- where $\Delta Y_t = Y_{t+1} - Y_t$. This does in fact work: -->
<!-- ```{r} -->
<!-- broom::tidy(lm(diff(rw1) ~ diff(rw2))) -->
<!-- ``` -->
<!-- ```{r, echo = FALSE} -->
<!-- n <- 1000 -->
<!-- dat <- matrix(0, nrow = n, ncol = 2) -->
<!-- B <- cbind( -->
<!-- c(.4, .2), -->
<!-- c(-.2, .4) -->
<!-- ) -->
<!-- for (i in seq(2, n)) { -->
<!-- z <- rnorm(1) -->
<!-- # dat[i, ] <- dat[i-1, ] %*% B + rnorm(2) -->
<!-- dat[i, ] <- c(.8, .4) * z + rnorm(2) -->
<!-- } -->
<!-- ``` -->
<!-- Why? Let $\eta_t$ and $\psi_t$ denote the errors of the two processes $Y$ and $X$, respectively, distributed according to zero-mean Gaussian with variances $\sigma_y$ and $\sigma_x$. We write -->
<!-- $$ -->
<!-- \Delta Y_t = \sum_{i=1}^{t+1} \eta_i - \sum_{i=1}^{t} \eta_i = \eta_{t+1} \sim \mathcal{N}(0, \sigma_y^2) \\[1em] -->
<!-- \Delta X_t = \sum_{i=1}^{t+1} \psi_i - \sum_{i=1}^{t} \eta_i = \psi_{t+1} \sim \mathcal{N}(0, \sigma_x^2) \enspace . -->
<!-- $$ -->
<!-- Now, since the respective differences are independent of each other, their correlation will be zero. -->
<!-- However, Hamilton notes that if the time-series are really stationary ($\vert \phi \lvert < 1$), then this can result in misspecified regression. Moreover, if $Y$ and $X$ are non-stationary but *cointegrated processes*, then this also will result in misspecification. -->
<h2 id="conclusion">Conclusion</h2>
<p>“Correlation does not imply causation” is a common response to apparently spurious correlation. The idea is that we observe spurious associations because we do not have the full causal picture, as in the example of storks and human babies. In this blog post, we have seen that spurious correlation can be due to solely statistical reasons. In particular, we have seen that two independent random walks can be highly correlated. This can be diagnosed by looking at the residuals, which will <em>not</em> be independent and identically distributed, but will show a pronounced autocorrelation.</p>
<p>The mathematical explanation for the spurious correlation is not trivial. Using simulations, we found that the estimate of $\beta_1$ does not converge to the true value in the case of regressing one independent random walk onto another. Moreover, the test statistic diverges, meaning that with increasing sample size we are almost certain to reject the null hypothesis of no association. The spurious correlation occurs because our estimate is not consistent, which is a purely statistical explanation that does not invoke causal reasoning.</p>
<hr />
<p><em>I want to thank Toni Pichler and Andrea Bacilieri for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<!-- ### Mean and variance of AR(1) and random walk -->
<!-- To better understand the differences between AR(1) processes and random walks, we look at their respective first two moments. We write out the process for some window of length $j$, and then recursively substitute: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- Y_t &= \phi \, Y_{t-1} + \epsilon_t \\[.5em] -->
<!-- &= \phi \, \left(\phi \, Y_{t-2} + \epsilon_{t-1}\right) + \epsilon_t \\[.5em] -->
<!-- &= \phi \, \left(\phi \, \left(\phi \, Y_{t-3} + \epsilon_{t-2}\right) + \epsilon_{t-1}\right) + \epsilon_t \\[.5em] -->
<!-- &= \vdots \\[.5em] -->
<!-- &= \phi^{j + 1} \, Y_{t - (j + 1)} + \sum_{i=t}^{t - (j + 1)} \phi^i \epsilon_{t-i} \\[.5em] -->
<!-- &= \sum_{i=0}^{t-1} \phi^i \epsilon_{t-i} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where we assume that $Y_0 = 0$ is fixed. Let's compute the first two moments of this process. Exploiting linearity, we write: -->
<!-- $$ -->
<!-- \mathbb{E}[Y_t] = \mathbb{E}\left[\sum_{i=0}^{t-1} \phi^i \epsilon_{t-i}\right] = \sum_{i=0}^{t-1} \mathbb{E}\left[\phi^i \epsilon_{t-i}\right] = \sum_{i=0}^{t-1} \phi^i \mathbb{E}\left[\epsilon_{t-i}\right] = 0 \enspace . -->
<!-- $$ -->
<!-- This is also true for $\phi = 1$, i.e., a random walk. For the variance, we write: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{Var}\left[Y_t\right] &= \mathbb{E}\left[\left(Y_t - \mathbb{E}[Y_t]\right)^2\right] -->
<!-- = \mathbb{E}\left[Y_t^2\right] -->
<!-- = \mathbb{E}\left[\left(\sum_{i=0}^{t-1} \phi^i \epsilon_{t-i}\right)^2\right] \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where we split the quadratic into ["diagonal"](https://math.stackexchange.com/questions/125435/what-is-the-opposite-of-a-cross-term) terms and cross-terms, the latter of which have expectation zero by our assumption that the residuals are independent: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{Var}\left[Y_t\right] &= \mathbb{E}\left[\sum_{i=0}^{t - 1} \left(\phi^i \epsilon_{t-i}\right)^2 + \sum_{i=0}^{t - 1} \sum_{j\neq i}^{t - 1} \left(\phi^i \epsilon_{t-i}\right) \left(\phi^j \epsilon_{t-j}\right)\right] \\[.5em] -->
<!-- &= \mathbb{E}\left[\sum_{i=0}^{t - 1} \left(\phi^i \epsilon_{t-i}\right)^2\right] \\[.5em] -->
<!-- &= \sum_{i=0}^{t - 1} \mathbb{E}\left[\left(\phi^i \epsilon_{t-i}\right)^2\right] \\[.5em] -->
<!-- &= \sum_{i=0}^{t - 1} \left(\phi^i\right)^2 \mathbb{E}\left[\epsilon_{t-i}^2\right] \\[.5em] -->
<!-- &= \sigma^2\sum_{i=0}^{t - 1} \left(\phi^2\right)^i \\[.5em] -->
<!-- &= \sigma^2 \frac{1}{1 - \phi^2} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where the last line follows when $N \rightarrow \infty$ for $\vert\phi\vert < 1$ from a geometric series. For a random walk, however, this is not a geometric series anymore; it therefore does not converge, and the variance of a random walk does not exist. -->
<h3 id="spurious-correlation-of-ar1-processes">Spurious correlation of AR(1) processes</h3>
<p>In the main text, we have looked at how the spurious correlation behaves for a random walk. Here, we study how the spurious correlation behaves as a function of $\phi \in [0, 1]$. We focus on sample sizes of $n = 200$, and adapt the simulation code from above.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">200</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">phis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.02</span><span class="p">)</span><span class="w">
</span><span class="n">comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phis</span><span class="p">)</span><span class="w">
</span><span class="n">ncomb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">comb</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncomb</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ix'</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'phi'</span><span class="p">,</span><span class="w"> </span><span class="s1">'cor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">,</span><span class="w"> </span><span class="s1">'pval'</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ncomb</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">phi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cor.test</span><span class="p">(</span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">),</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">p.value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">phi</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">avg_abs_corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">cor</span><span class="p">)),</span><span class="w">
</span><span class="n">avg_abs_tstat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">tstat</span><span class="p">)),</span><span class="w">
</span><span class="n">percent_sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">pval</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>The Figure below shows that the issue of spurious correlation gets progressively worse as the AR(1) process approaches a random walk (i.e., $\phi = 1$). While this is true, the regression estimate remains consistent.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<h2 id="references">References</h2>
<ul>
<li>Granger, C. W., & Newbold, P. (<a href="http://wolfweb.unr.edu/~zal/STAT758/Granger_Newbold_1974.pdf">1974</a>). Spurious regressions in econometrics. <em>Journal of Econometrics, 2</em>(2), 111-120.</li>
<li>Hamilton, J. D. (<a href="https://press.princeton.edu/titles/5386.html">1994</a>). Time Series Analysis. P. Princeton, US: Princeton University Press.</li>
<li>Kuiper, R. M., & Ryan, O. (<a href="https://www.tandfonline.com/doi/full/10.1080/10705511.2018.1431046">2018</a>). Drawing conclusions from cross-lagged relationships: Re-considering the role of the time-interval. <em>Structural Equation Modeling: A Multidisciplinary Journal, 25</em>(5), 809-823.</li>
<li>Phillips, P. C. (<a href="http://dido.econ.yale.edu/korora/phillips/pubs/art/a044.pdf">1986</a>). Understanding spurious regressions in econometrics. <em>Journal of Econometrics, 33</em>(3), 311-340.</li>
<li>Matthews, R. Storks deliver babies (p = 0.008) (<a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9639.00013?casa_token=cWUllTD9P14AAAAA:PRERZz-uS2z9xX3DGt0-Qize94FuZuw-35s-2ECfUDY9Oi3J1m83cZh8EBHGlGh7fwQ2WHShOQuwB-YO">2000</a>). <em>Teaching Statistics 22</em>(2), 36–38.</li>
</ul>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>There are, of course, many <a href="https://www.tylervigen.com/spurious-correlations">more</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Thanks to Toni Pichler for drawing my attention to the fact that independent random walks are correlated, and Andrea Bacilieri for providing me with the classic references. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>Moreover, one way to avoid the spurious correlation is to <em>difference</em> the time-series. For other approaches, see Hamilton (1994, pp. 561). <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>This awesome picture was made by Luther Seet. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderThe number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.1 In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: completely independent processes can be correlated substantially. AR(1) processes and random walks Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as: where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a random walk. We can simulate from these using the following code: There are, of course, many more. ↩Bayesian modeling using Stan: A case study2019-05-30T10:00:00+00:002019-05-30T10:00:00+00:00https://fabiandablander.com/r/Law-of-Practice<link rel="stylesheet" href="../highlight/styles/default.css" />
<script src="../highlight/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<script>$('pre.stan code').each(function(i, block) {hljs.highlightBlock(block);});</script>
<p>Practice makes better. And faster. But what exactly is the relation between practice and reaction time? In this blog post, we will focus on two contenders: the <em>power law</em> and <em>exponential</em> function. We will implement these models in Stan and extend them to account for learning plateaus and the fact that, with increased practice, not only the mean reaction time but also its variance decreases. We will contrast two perspectives on predictive model comparison: a <em>(prior) predictive</em> perspective based on marginal likelihoods, and a <em>(posterior) predictive</em> perspective based on leave-one-out cross-validation. So let’s get started!<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<h1 id="two-models">Two models</h1>
<p>We can model the relation between reaction time (in seconds) and the number of practice trials as a power law function. Let $f: \mathbb{N} \rightarrow \mathbb{R}^+$ be a function that maps the number of trials to reaction times. We write</p>
<script type="math/tex; mode=display">f_p(N) = \alpha + \beta N^{-r} \enspace ,</script>
<p>where $\alpha$ is a lower bound (one cannot respond faster than that due to processing and motor control limits); $\beta$ is the learning gain from practice with respect to the first trial ($N = 1$); $N$ indexes the particular trial; and $r$ is the learning rate. Similarly, we can write</p>
<script type="math/tex; mode=display">f_e(N) = \alpha + \beta e^{-rN} \enspace ,</script>
<p>where the parameters have the same interpretation, except that $\beta$ is the learning gain from practice compared to no practice ($N = 0$).<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<p>What is the main difference between those two functions? <em>The exponential model assumes a constant learning rate, while the power model assumes diminishing returns</em>. To see this, let $\alpha = 0$, and ignore for the moment that $N$ is discrete. Taking the derivative for the power law model results in</p>
<script type="math/tex; mode=display">\frac{\partial f_p(N)}{\partial N} = -r\beta N^{-r - 1} = (-r/N) \, \beta N^{-r} = (-r/N) \, f_p(N) \enspace ,</script>
<p>which shows that the <em>local learning rate</em> — the change in reaction time as a function of $N$ — is $-r/N$; it depends on how many trials have been completed previously. The more one has practiced, the smaller the local learning rate $-r / N$. The exponential function, in contrast, shows no such dependency on practice:</p>
<script type="math/tex; mode=display">\frac{\partial f_e(N)}{\partial N} = -r\beta e^{-rN} = -r \, f_e(N) \enspace .</script>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>The figure above visualizes two data sets, generated from either a power law (left) or an exponential model (right), as well as the maximum likelihood fit of both models to these data. It is rather difficult to tell which model performs best just by eyeballing the fit. We thus need to engage in a more formal way of comparing models.</p>
<h1 id="two-perspectives-on-prediction">Two perspectives on prediction</h1>
<p>Let’s agree that the best way to compare models is to look at predictive accuracy. Predictive accuracy with respect to what? The figure below illustrates two different answers one might give. The grey shaded surface represents unobserved data; the white island inside is the observed data; the model is denoted by $\mathcal{M}$.</p>
<p><img src="../assets/img/prediction-perspectives.png" align="center" style="padding: 10px 10px 10px 10px;" /></p>
<p>On the left, the model makes predictions <em>before</em> seeing any data by means of its <em>prior predictive distribution</em>. The predictive accuracy is then evaluated on the actually observed data. In contrast, on the right, the model makes predictions <em>after</em> seeing the data by means of its <em>posterior predictive distribution</em>. In principle, its predictive accuracy is evaluated on data one does not observe (visualized as the grey area). One can estimate this expected <em>out-of-sample</em> predictive accuracy by cross-validation procedures which partition the observed data into a training and test set. The model only sees the training set, and makes predictions for the unseen test set.</p>
<p>One key practical distinction between these two perspectives is how predictions are generated. On the left, predictions are generated from the prior. On the right, the prior first gets updated to the posterior using the observed data, and it is through the posterior that predictions are made. In the next two sections, we make this difference more precise.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<h2 id="prior-prediction-marginal-likelihoods">Prior prediction: Marginal likelihoods</h2>
<p>From this perspective, we weight the model’s prediction of the observed data given a particular parameter setting by the prior. This is accomplished by integrating the likelihood with respect to the prior, which gives the so-called <em>marginal likelihood</em> of a model $\mathcal{M}$:</p>
<script type="math/tex; mode=display">p(y \mid \mathcal{M}) = \int_{\Theta} p(y \mid \theta, \mathcal{M}) \, p(\theta \mid \mathcal{M}) \, \mathrm{d}\theta \enspace .</script>
<p>It is clear that the prior matters a great deal, but that is no surprise — it is part of the model. A bad prior means a bad model. The ratio of two such marginal likelihoods is known as the Bayes factor.</p>
<p>If one is willing to assign priors to models, one can compute posterior model probabilities, i.e.,</p>
<script type="math/tex; mode=display">\begin{equation}
p(\mathcal{M}_k \mid y) = p(\mathcal{M}_k) \times \frac{p(y \mid \mathcal{M}_k)}{\sum_{i=1}^K p(y \mid \mathcal{M}_i) \, p(\mathcal{M}_i)} \enspace ,
\end{equation}</script>
<p>where $K$ is the number of models under consideration (see also a <a href="https://fdabl.github.io/r/Spike-and-Slab.html">previous</a> blogpost). Observe that the marginal likelihood features prominently: it is an updating factor from prior to posterior model probability. With this, one can also compute <em>Bayesian model-averaged</em> predictions:</p>
<script type="math/tex; mode=display">p(\tilde{y} \mid y) = \sum_{k=1}^K p(\tilde{y} \mid y, \mathcal{M}_k) \, \underbrace{p(\mathcal{M}_k \mid y)}_{w_k} \enspace ,</script>
<p>where $\tilde{y}$ is unseen data, and where the prediction of each model gets weighted by its posterior probability. We denote a model weight, which in this case is its posterior probability, as $w_k$.</p>
<h2 id="posterior-prediction-leave-one-out-cross-validation">Posterior prediction: Leave-one-out cross-validation</h2>
<p>Another perspective aims to estimate the expected out-of-sample prediction error, or expected log predictive density, i.e.,</p>
<script type="math/tex; mode=display">\text{elpd}^{\mathcal{M}} = \mathbb{E}_{\tilde{y}} \left(\text{log} \, p(\tilde{y} \mid y) \right) \enspace ,</script>
<p>where the expectation is taken with respect to unseen data $\tilde{y}$ (which is visualized as a grey surface with an $?$ inside in the figure above).</p>
<p>Clearly, as we do not have access to unseen data, we cannot evaluate this. However, one can approximate this quantity by computing the leave-one-out prediction error in our sample:</p>
<script type="math/tex; mode=display">\widehat{\text{elpd}}^{\mathcal{M}}_{\text{loo}} = \frac{1}{n} \sum_{i=1}^n \, \text{log} \, p(y_i \mid y_{-i}) \approx \mathbb{E}_{\tilde{y}} \left(\text{log} \, p(\tilde{y} \mid y) \right) \enspace,</script>
<p>where $y_i$ is the $i^{\text{th}}$ data point, and $y_{-i}$ are all data points except $y_i$, where we have suppressed conditioning on $\mathcal{M}$ to not clutter notation (even more), and where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y_i \mid y_{-i}) &= \int_{\Theta} p(y_i, \theta \mid y_{-i}) \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} p(y_i \mid y_{-i}, \theta) \, p(\theta \mid y_{-i}) \, \mathrm{d}\theta \enspace .
\end{aligned} %]]></script>
<p>Observe that this requires integrating over the <em>posterior distribution</em> of $\theta$ given all but one data point; this is in contrast to the marginal likelihood perspective, which requires integration with respect to the <em>prior distribution</em>. From this perspective, one can similarly compute model weights $w_k$</p>
<script type="math/tex; mode=display">w_k = \frac{\text{exp}\left(\widehat{\text{elpd}}^{\mathcal{M}_k}_{\text{loo}}\right)}{\sum_{i=1}^K \text{exp}\left(\widehat{\text{elpd}}^{\mathcal{M}_i}_{\text{loo}}\right)} \enspace ,</script>
<p>where $\widehat{\text{elpd}}_{\text{loo}}^{\mathcal{M}_k}$ is the loo estimate for the expected log predictive density for model $\mathcal{M}_k$. For prediction, one averages across models using these <em>Pseudo Bayesian model-averaging</em> weights (Yao et al., 2018, p. 92).</p>
<p>However, Yao et al. (2018) and Vehtari et al. (2019) recommend against using these Pseudo BMA weights, as they do not take the uncertainty of the loo estimates into account. Instead, they suggest using Pseudo-BMA+ weights or stacking. For details, see Yao et al. (2018).</p>
<p>For an illuminating discussion about model selection based on marginal likelihoods or leave-one-out cross-validation, see Gronau & Wagenmakers (<a href="https://link.springer.com/article/10.1007/s42113-018-0011-7">2019a</a>), Vehtari et al. (<a href="https://link.springer.com/article/10.1007/s42113-018-0020-6">2019</a>), and Gronau & Wagenmakers (<a href="https://link.springer.com/article/10.1007/s42113-018-0022-4">2019b</a>).</p>
<p>Now that we have taken a look at these two perspectives on prediction, in the next section, we will implement the power law and the exponential model in Stan.</p>
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{PSBF}_{10} &= \frac{\prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_1)}{\prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_0)} \\[1em] -->
<!-- &= \text{exp}\left(\text{log} \, \prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_1) - \text{log} \, \prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_0)\right) \\[.5em] -->
<!-- &= \text{exp}\left(\sum_{i=1}^n \text{log} \, p(y_i \mid y_{-i}, \mathcal{M}_1) - \sum_{i=1}^n \text{log} \, p(y_i \mid y_{-i}, \mathcal{M}_0)\right) \\[.5em] -->
<!-- &= \text{exp}\left(\hat{\text{elpd}}^{\mathcal{M}_1}_{\text{loo}} - \hat{\text{elpd}}^{\mathcal{M}_0}_{\text{loo}} \right) \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<h1 id="implementation-in-stan">Implementation in Stan</h1>
<p>As is common in psychology, people do not deterministically follow a power law or an exponential law. Instead, the law is probabilistic: given the same task, the person will respond faster or slower, never exactly as before. To allow for this, we assume that there is Gaussian noise around the function value. In particular, we assume that</p>
<script type="math/tex; mode=display">\text{RT}_N \sim \mathcal{N}\left(\alpha + \beta N^{-r}, \sigma_e^2\right) \enspace .</script>
<p>Note that are not normally distributed; we address this later. We make the same assumption for the exponential model. The following code implements the power law model in Stan.</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">// In the data block, we specify everything that is relevant for
// specifying the data. Note that PRIOR_ONLY is a dummy variable used later.
data {
int<lower=1> n;
real y[n];
int<lower=0, upper=1> PRIOR_ONLY;
}
// In the parameters block, we specify all parameters we need.
// Although Stan implicitly adds flat priors on the (positive) real line
// we will specify informative priors below.
parameters {
real<lower=0> r;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> sigma_e;
}
// In the model block, we specify our informative priors and
// the likelihood of the model, unless we want to sample only from
// the prior (i.e., if PRIOR_ONLY == 1)
model {
target += lognormal_lpdf(alpha | 0, .5);
target += lognormal_lpdf(beta | 1, .5);
target += gamma_lpdf(r | 1, 3);
target += gamma_lpdf(sigma_e | 0.5, 5);
if (PRIOR_ONLY == 0) {
for (trial in 1:n) {
target += normal_lpdf(y[trial] | alpha + beta * trial^(-r), sigma_e);
}
}
}
// In this block, we make posterior predictions (ypred) and compute
// the log likelihood of each data point (log_lik)
// which is needed for the computation of loo later
generated quantities {
real ypred[n];
real log_lik[n];
for (trial in 1:n) {
ypred[trial] = normal_rng(alpha + beta * trial^(-r), sigma_e);
log_lik[trial] = normal_lpdf(y[trial] | alpha + beta * trial^(-r), sigma_e);
}
}</code></pre></figure>
<p>From a marginal likelihood perspective, the prior is an integral part of the model; this means we have to think very carefully about it. There are several principles that can guide us (see also Lee & Vanpaemel, 2018), but one that is particularly helpful here is to look at the prior predictive distribution. Do draws from the prior predictive distribution look like what we had in mind? Below, I have visualized the mean, the standard deviation around the mean, and several draws from it for (a) flat priors on the positive real line, and (b) informed priors that I chose based on reading Evans et al. (2018). In the Stan code, you can specify flat priors by commenting out the priors we have specified in the model block.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>The Figure on the left shows that flat priors make terrible predictions. The mean of the prior predictive distribution is a constant function at zero, which is not at all what we had in mind when writing down the power law model. Even worse, flat priors allow for negative reaction time, something that is clearly impossible! In contrast, the Figure on the right seems reasonable. Below I have visualized the informed priors.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>From a cross-validation perspective, priors do not matter <em>that</em> much; prediction is conditional on the observed data, and so the prior is transformed to a posterior before the model makes predictions. If the prior is not too misspecified, or if we have a sufficient amount of data so that the prior has only a weak influence on the posterior, a model’s posterior predictions will not markedly depend on it.</p>
<h1 id="practical-model-comparison">Practical model comparison</h1>
<p>Let’s check whether we select the correct model for the power law and the exponential data, respectively. I have generated the data above using the following code, which will make sense later in the blog post.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sim_power</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sdlog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rlnorm</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">N</span><span class="p">),</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">N</span><span class="o">^</span><span class="p">(</span><span class="o">-</span><span class="n">r</span><span class="p">)),</span><span class="w"> </span><span class="n">sdlog</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">sim_exp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sdlog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rlnorm</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">N</span><span class="p">),</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="o">-</span><span class="n">r</span><span class="o">*</span><span class="n">N</span><span class="p">)),</span><span class="w"> </span><span class="n">sdlog</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">30</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">xp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sim_power</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="n">xe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sim_exp</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">.2</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span></code></pre></figure>
<p>We use the <em>bridgesampling</em> and the <em>loo</em> package to estimate Bayes factors and loo scores and stacking weights, respectively.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'loo'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'bridgesampling'</span><span class="p">)</span><span class="w">
</span><span class="n">comp_power</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s1">'stan-compiled/compiled-power-model.RDS'</span><span class="p">)</span><span class="w">
</span><span class="n">comp_exp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s1">'stan-compiled/compiled-exponential-model.RDS'</span><span class="p">)</span><span class="w">
</span><span class="c1"># power model data</span><span class="w">
</span><span class="n">fit_pp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_power</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xp</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ep</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xp</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># exponential data</span><span class="w">
</span><span class="n">fit_pe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_power</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xe</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ee</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xe</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>But before we do so, let’s visualize the posterior predictions of both models for each simulated data set, respectively.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>In the figure on the left, we see that the posterior predictions for the exponential model have a larger variance compared to the predictions of the power law model. Conversely, on the right it seems that the exponential model gives the better predictions.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p>We first compare the two models on the power law data, using the Bayes factor and loo. With the former, we find overwhelming evidence in favour of the power law model.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_pp</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ep</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 54950.14619</code></pre></figure>
<p>Note that this estimate can vary with different runs (since we were stingy in sampling from the posterior). The comparison using loo yields the following output:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_pp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_pp</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_ep</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ep</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo_pp</span><span class="p">,</span><span class="w"> </span><span class="n">loo_ep</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -14.2 5.4</code></pre></figure>
<p>Note that the best model is always on top, and the comparison is already on the difference score. Following a two standard error heuristic (but see <a href="https://discourse.mc-stan.org/t/interpreting-output-from-compare-of-loo/3380/4">here</a>), since the difference in the elpd scores is more than twice its standard error, we would choose the power law model as the better model. <strong>But wait</strong> – what do these warnings mean? Let’s look at the output of the loo function for the exponential law (suppressing the warning):</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_ep</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">##
## Computed from 16000 by 30 log-likelihood matrix
##
## Estimate SE
## elpd_loo 22.9 5.7
## p_loo 7.4 3.9
## looic -45.8 11.5
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 28 93.3% 6801
## (0.5, 0.7] (ok) 1 3.3% 369
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 1 3.3% 9
## See help('pareto-k-diagnostic') for details.</code></pre></figure>
<p>We find that there are is one very bad Pareto $k$ value. What does this mean? There are many details about loo that you can read up on in Vehtari et al. (2018). Put briefly, to efficiently compute the loo score of a model, Vehtari et al. (2018) use <em>importance sampling</em> to compute the predictive density, which requires finding importance weights to better approximate this density. These importance weights are known to be unstable, and the authors introduce a particular stabilizing transformation which they call “Pareto smoothed importance sampling” (PSIS). The parameter $k$ is the shape parameter of this (generalized) Pareto distribution. If it is high, such as $k > 0.7$ as we find for one data point here, then this implies unstable estimates — we should probably not trust the loo estimate for this model.<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup> Similarly pathological behaviour can be diagnosed by p_loo, which gives an estimate of the effective number of parameters. In this case, this is about double the number of actual parameters ($\alpha, \beta, r, \sigma_e^2$).</p>
<p>The two figures below visualize the $k$ values for each data point. We see that loo has troubles predicting the first data point for both the exponential and power law model.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<p>We can also compute the stacking weights; but again, one should probably not trust these estimates:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fit1</span><span class="p">,</span><span class="w"> </span><span class="n">fit2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">log_lik_list</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="s1">'1'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit1</span><span class="p">),</span><span class="w">
</span><span class="s1">'2'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">r_eff_list</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="s1">'1'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">relative_eff</span><span class="p">(</span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit1</span><span class="p">,</span><span class="w"> </span><span class="n">merge_chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)),</span><span class="w">
</span><span class="s1">'2'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">relative_eff</span><span class="p">(</span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit2</span><span class="p">,</span><span class="w"> </span><span class="n">merge_chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">loo_model_weights</span><span class="p">(</span><span class="n">log_lik_list</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'stacking'</span><span class="p">,</span><span class="w"> </span><span class="n">r_eff_list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r_eff_list</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_pp</span><span class="p">,</span><span class="w"> </span><span class="n">fit_ep</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 0.995
## 2 0.005</code></pre></figure>
<p>We could compute a “Stacked Pseudo Bayes factor” by taking the ratio of these two weights to see how much more weight one model is given compared to the other.<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> This yields a factor of about 120 in favour of the power law model.</p>
<p>We can make the same comparisons using the exponential data. The Bayes factor again yields overwhelming support for the model that is closer to the true model, i.e., in this case the exponential model.<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup> But note that the evidence is an order of magnitude smaller than in the above comparison.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_pe</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 1024.51382</code></pre></figure>
<p>This marked decrease in evidence is also tracked by loo, which now tells us that we cannot reliably distinguish between the two models<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">),</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_pe</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -8.3 5.6</code></pre></figure>
<p>Note that while we still get warnings, this time we only have one data point with $k \in [0.7, 1]$, which is bad, but not very bad. The stacking weights show that there is not a clear winner, with a factor of only about 6 in favour of the exponential model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">,</span><span class="w"> </span><span class="n">fit_pe</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 0.847
## 2 0.153</code></pre></figure>
<p>Now that we have seen how one might compare these models in practice, in the next two sections, we will see how we can extend the model to be more realistic. To that end, I will focus only on the exponential model. In fact, the exponential model is what many researchers now prefer as the “law of practice”. In a very influential article, Newell & Rosenbloom (1981) found that a power law fit best for data from a wide variety of tasks. While they relied on averaged data, Heathcote et al. (2000) looked at participant-specific data. They found that the decrease in reaction time follows an exponential function, implying that previous results were biased due to averaging; in fact, one can show that the (arithmetic) averaging of many exponential (i.e., non-linear) functions can lead to a group-level power law when individual differences exist (Myung, Kim, & Pitt, 2000).</p>
<h1 id="extension-i-modeling-plateaus">Extension I: Modeling plateaus</h1>
<p>The above two models assume that participants <em>get it</em> from the first trial onwards, and become better immediately. However, real data often exhibits a <em>plateau</em>. We simulate such data using the following lines of code.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
</span><span class="n">rlnorm</span><span class="p">(</span><span class="m">15</span><span class="p">,</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">),</span><span class="w">
</span><span class="n">sim_exp</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-17-1.png" title="plot of chunk unnamed-chunk-17" alt="plot of chunk unnamed-chunk-17" style="display: block; margin: auto;" /></p>
<p>How can we model this? Evans et al. (2018) suggest introducing a single parameter, $\tau$, and adjusting the model as follows:</p>
<script type="math/tex; mode=display">f_e(N) = \alpha + \beta \, \frac{\tau + 1}{\tau + e^{rN}} \enspace .</script>
<p>Observe that for $\tau = 0$, we recover the original exponential model. For $\tau \rightarrow \infty$, the function becomes a constant function $\alpha + \beta$. Thus, large values for $\tau$ (and large values for $r$, so as to model the steep drop in reaction time) allow us to model the initial plateau we find in real data.</p>
<p>We can adjust the model in Stan easily:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">data {
int<lower=1> n;
real y[n];
int<lower=0, upper=1> PRIOR_ONLY;
}
parameters {
real<lower=0> r;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> sigma_e;
real<lower=0> tau;
}
model {
real mu;
// Renormalize Cauchy prior due to truncation to get correct marginal likelihood
target += cauchy_lpdf(tau | 0, 1) - cauchy_lccdf(0 | 0, 1);
target += lognormal_lpdf(alpha | 0, .5);
target += lognormal_lpdf(beta | 1, .5);
target += gamma_lpdf(r | 1, 3);
target += gamma_lpdf(sigma_e | 0.5, 5);
if (PRIOR_ONLY == 0) {
for (trial in 1:n) {
mu = alpha + beta * (tau + 1) / (tau + exp(r * trial));
target += normal_lpdf(y[trial] | mu, sigma_e);
}
}
}
generated quantities {
real mu;
real ypred[n];
real log_lik[n];
for (trial in 1:n) {
mu = alpha + beta * (tau + 1) / (tau + exp(r * trial));
ypred[trial] = normal_rng(mu, sigma_e);
log_lik[trial] = normal_lpdf(y[trial] | mu, sigma_e);
}
}</code></pre></figure>
<p>We have put a half-Cauchy prior on $\tau$. This is because the Cauchy distribution has very fat tails, compared to for example the Normal or the Laplace distribution; see Figure below. This is desired, because we need large $\tau$ values to accommodate plateaus.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-19-1.png" title="plot of chunk unnamed-chunk-19" alt="plot of chunk unnamed-chunk-19" style="display: block; margin: auto;" /></p>
<p>The figure below compares the prior predictive distributions of the exponential to the <em>delayed exponential</em> model. As we can see, the additional $\tau$ parameter creates larger uncertainty in the predictions, with the some individual draws looking completely different from each other. <a href="https://betanalpha.github.io/assets/case_studies/fitting_the_cauchy.html">Drawing samples from a Cauchy</a>, whose mean and variance are <a href="https://en.wikipedia.org/wiki/Cauchy_distribution#Explanation_of_undefined_moments">undefined</a>, is tricky.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-20-1.png" title="plot of chunk unnamed-chunk-20" alt="plot of chunk unnamed-chunk-20" style="display: block; margin: auto;" /></p>
<p>However, if we compare the two models to each other on the plateau data set, we see that the extended model predicts the data better:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit_ee</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ee_plateau</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp_delayed</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 304.35681</code></pre></figure>
<p>This is a large Bayes factor. However, loo seems to favour the delayed exponential model much more, by about three standard errors:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">),</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -10.4 3.3</code></pre></figure>
<p>This is also reflected in an extreme difference in stacking weights, which completely discounts the standard exponential model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">,</span><span class="w"> </span><span class="n">fit_ee</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 1.000
## 2 0.000</code></pre></figure>
<p>Earlier, when comparing the exponential and power model on data generated from an exponential model, we found a Bayes factor in favour of the exponential model of about 1000, while the difference in loo was only about 1.5 standard errors. Here, we now find a Bayes factor in favour of the delayed exponential model of only about 300, while loo finds a difference of about three standard errors. This contrast becomes is illuminated by visualizing the posterior predictive distribution, see below.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-24-1.png" title="plot of chunk unnamed-chunk-24" alt="plot of chunk unnamed-chunk-24" style="display: block; margin: auto;" /></p>
<p>We see that the delayed exponential model seems to “fit” the data much better.<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup> Since loo basically uses this posterior predictive distribution (except that it removes one data point) to predict individual data points, it now seems clearer why it should favour the delayed exponential model so much more strongly. In contrast, the prior predictive distributions of the exponential and the delayed exponential (visualized above) are rather similar. Since the Bayes factor evaluated the probability of the data given these prior predictive distributions, the evidence in favour of the delayed exponential model is only modest.</p>
<h1 id="extension-ii-modeling-reaction-times">Extension II: Modeling reaction times</h1>
<h2 id="the-lognormal-distribution">The Lognormal distribution</h2>
<p>Reaction times are well known to be non-normally distributed. One popular distribution for them is the <em>Lognormal distribution</em>. Let $X \sim \mathcal{N}(\mu, \sigma^2)$. Then $Z = e^X$ is lognormally distributed. To arrive at its density function, we do a change of variables. Observe that</p>
<script type="math/tex; mode=display">P_z(Z \leq z) = P_z\left(e^X \leq z\right) = P_x(X \leq \text{log}\,z) = F_x(\text{log}\,z) \enspace ,</script>
<p>where $F_x$ is the cumulative distribution function of $X$. Differentiating with respect to $z$ yields the probability density function for $Z$:</p>
<script type="math/tex; mode=display">p_z(z) = \frac{P_z(Z \leq z)}{\mathrm{d} z} = \frac{F_x(\text{log}\,z)}{\mathrm{d} z} = p_x(\text{log}\,z) \left|\frac{\text{log}\,z}{\mathrm{d}z}\right| = p_x(\text{log}\,z) \frac{1}{z} \enspace ,</script>
<p>which spelled out is</p>
<script type="math/tex; mode=display">p_z(z) = \frac{1}{\sqrt{2\pi\sigma^2}} \text{exp}\left(-\frac{1}{2\sigma^2} (\text{log}\,z - \mu)^2\right) \frac{1}{z} \enspace .</script>
<p>The figure below visualizes various lognormal distribution with different parameters $\mu$ and $\sigma^2$. The figure on the left shows how a change in $\sigma^2$ affects the distribution, while keeping $\mu = 1$. You can see that the “peak” of the distribution changes, indicating that the parameter $\mu$ is not independent of $\sigma^2$. In fact, while the mean and variance of a normal distribution are given by $\mu$ and $\sigma^2$, respectively, this is not so for a lognormal distribution. It seems difficult to compute the first two moments of the Lognormal distributions directly (you can try if you want!). However, there is a neat trick to compute <em>all</em> its moments basically instantaneously.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-25-1.png" title="plot of chunk unnamed-chunk-25" alt="plot of chunk unnamed-chunk-25" style="display: block; margin: auto;" /></p>
<p>To do this, observe that the <a href="https://en.wikipedia.org/wiki/Moment-generating_function"><em>moment generating function</em></a> (MGF) of a random variable $X$ is given by</p>
<script type="math/tex; mode=display">M_X(t) := \mathbb{E}\left[e^{tX}\right] \enspace .</script>
<p>Now, we’re not going to use the MGF of the Lognormal — in fact, because the integral diverges, it does not exist. Instead, we’ll use the MGF of a Normal distribution. Since $Z = e^X$, we can write the $t^{\text{th}}$ moment such that</p>
<script type="math/tex; mode=display">\mathbb{E}\left[Z^t\right] = \mathbb{E}\left[e^{tX}\right] = M_X(t) = \text{exp}\left(t\mu + \frac{1}{2}t^2 \sigma^2\right) \enspace ,</script>
<p>where the last term is the MGF of a normal distribution (see also Blitzstein & Hwan, 2014, p. 260-261). Thus, the mean and variance of a Lognormal distribution are given by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[Z] &= \text{exp}\left(\mu + \frac{1}{2} \sigma^2\right) \\[.5em]
\text{Var}[Z] &= \mathbb{E}[Z^2] - \mathbb{E}[Z]^2 \\[.5em]
&= \text{exp}\left(2\mu + 2\sigma^2\right) - \text{exp}\left(2\mu + \sigma^2\right) \\[.5em]
&= \text{exp}\left(2\mu + \sigma^2\right) \left(\text{exp}\left(\sigma^2\right) - 1 \right) \enspace .
\end{aligned} %]]></script>
<p>This dependency between mean and variance is desired. In particular, it is well established that changes in mean reaction times are accompanied by proportional changes in the standard deviation (Wagenmakers & Brown, 2007).</p>
<p><a href="https://fdabl.github.io/statistics/Two-Properties.html">In contrast</a> to the Normal distribution, the Lognormal distribution is not closed under addition. This means that if $Z$ has a Lognormal distribution, $Z + \delta$ does not necessarily have a Lognormal distribution anymore. However, we are interested in modeling <em>shifts</em> in reaction times. For example, there is a reaction time $\alpha$ faster which participants cannot meaningfully respond. To allow for such shifts, we expand the Lognormal distribution by a parameter $\delta$ such that $Z = e^{X - \delta}$.<sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup> This leads to the following density function</p>
<script type="math/tex; mode=display">p_z(z) = \frac{1}{\sqrt{2\pi\sigma^2}} \text{exp}\left(-\frac{1}{2\sigma^2} (\text{log}\,(z - \delta) - \mu)^2\right) \frac{1}{z - \delta} \enspace .</script>
<h2 id="extending-the-model">Extending the model</h2>
<p>We extend the delayed exponential model such that</p>
<script type="math/tex; mode=display">\text{RT} \sim \text{Shifted-Lognormal}\left(\delta, \, \text{log}\left(\alpha' + \beta \frac{\tau + 1}{\tau + e^{rN}}\right), \, \sigma \right) \enspace.</script>
<p>The median of a Shifted-Lognormal distribution is given by $\delta + e^\mu$, which is why we log the main part of the model above. Note that the previous asymptote $\alpha$ is now $\delta + \alpha’$. To be on the same scale as before, we assign $\delta$ and $\alpha’$ a Lognormal distribution with medians $0.50$ and $0.50$, respectively.</p>
<p>We can implement the model in Stan by writing a Shifted-Lognormal probability density function:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">// In the function block, we define our new probability density function
functions {
real shiftlognormal_lpdf(real z, real delta, real mu, real sigma) {
real lprob;
lprob = (
-log((z - delta)*sigma*sqrt(2*pi())) -
(log(z - delta) - mu)^2 / (2*sigma^2)
);
return lprob;
}
real shiftlognormal_rng(real delta, real mu, real sigma) {
return delta + lognormal_rng(mu, sigma);
}
}
// In the data block, we specify everything that is relevant for
// specifying the data. Note that PRIOR_ONLY is a dummy variable used later.
data {
int<lower=1> n;
real y[n];
int<lower=0, upper=1> PRIOR_ONLY;
}
// In the parameters block, we specify all parameters we need.
// Although Stan implicitly adds flat prior on the (positive) real line
// we will specify informative priors below.
parameters {
real<lower=0> r;
real<lower=0> tau;
real<lower=0> delta;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> sigma_e;
}
// In the model block, we specify our informative priors and
// the likelihood of the model, unless we want to sample only from
// the prior (i.e., if PRIOR_ONLY == 1)
model {
real mu;
// Renormalize Cauchy prior due to truncation to get correct marginal likelihood
target += cauchy_lpdf(tau | 0, 1) - cauchy_lccdf(0 | 0, 1);
target += lognormal_lpdf(delta | log(0.50), .5);
target += lognormal_lpdf(alpha | log(0.50), .5);
target += lognormal_lpdf(beta | 1, .5);
target += gamma_lpdf(r | 1, 3);
target += gamma_lpdf(sigma_e | 0.5, 5);
if (PRIOR_ONLY == 0) {
for (trial in 1:n) {
mu = log(alpha + beta * (tau + 1) / (tau + exp(r*trial)));
target += shiftlognormal_lpdf(y[trial] | delta, mu, sigma_e);
}
}
}
// In this block, we make posterior predictions (ypred) and compute
// the log likelihood of each data point given all the others (log_lik)
// which is needed for the computation of loo later
generated quantities {
real mu;
real ypred[n];
real log_lik[n];
for (trial in 1:n) {
mu = log(alpha + beta * (tau + 1) / (tau + exp(r*trial)));
ypred[trial] = shiftlognormal_rng(delta, mu, sigma_e);
log_lik[trial] = shiftlognormal_lpdf(y[trial] | delta, mu, sigma_e);
}
}</code></pre></figure>
<p>Visualizing the prior predictive distribution bares only limited insight into how these two models differ:</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-27-1.png" title="plot of chunk unnamed-chunk-27" alt="plot of chunk unnamed-chunk-27" style="display: block; margin: auto;" /></p>
<p>The Bayes factor in favour of the lognormal model is quite large:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit_ee_plateau_lognormal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp_delayed_log</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">adapt_delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.9</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee_plateau_lognormal</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 198.49106</code></pre></figure>
<p>In contrast, loo shows barely any evidence: we would <em>not</em> choose the lognormal model, but we remain undecided. (The warning is because we have one $k \in [0.5, 0.7]$.)</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee_plateau_lognormal</span><span class="p">),</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -5.1 3.3</code></pre></figure>
<p>Using stacking, the weights result in a factor of about 24 in favour of the lognormal model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_ee_plateau_lognormal</span><span class="p">,</span><span class="w"> </span><span class="n">fit_ee_plateau</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 0.983
## 2 0.017</code></pre></figure>
<p>The contrast between the Bayes factor and loo can again be illuminated by looking at the posterior predictive distribution for the respective models; see below. The lognormal model can account for decrease in variance with increased practice. Still, the predictions of both models are rather similar, so that there is little to distinguish them using loo.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-31-1.png" title="plot of chunk unnamed-chunk-31" alt="plot of chunk unnamed-chunk-31" style="display: block; margin: auto;" /></p>
<h1 id="modeling-recap">Modeling recap</h1>
<p>Focusing on the exponential model, we have successively made our modeling more sophisticated (see also Evans et al., 2018). Barring priors, we have encountered the following models:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(1) \hspace{1em} &\text{RT} \sim \mathcal{N}\left(\alpha + \beta e^{-rN}, \sigma^2 \right) \\[1em]
(2) \hspace{1em} &\text{RT} \sim \mathcal{N}\left(\alpha + \beta \, \frac{\tau + 1}{\tau + e^{rN}}, \sigma^2 \right) \\[.5em]
(3) \hspace{1em} &\text{RT} \sim \text{Shifted-Lognormal}\left(\delta, \, \text{log}\left(\alpha' + \beta \frac{\tau + 1}{\tau + e^{rN}}\right), \, \sigma \right) \enspace .
\end{aligned} %]]></script>
<p>We went from (1) to (2) to account for learning plateaus; sometimes, participants take a while for learning to really kick in. We went from (2) to (3) to account for the fact that reaction times are decidedly nonnormal, and that there is a linear relationship between the mean and the standard deviation of reaction times.</p>
<p>We have also compared the power law to the exponential function, but so far we have looked only at simulated data. Since this blog post is already quite lengthy, we defer the treatment of real data to a future blog post.</p>
<!-- # Using real data -->
<!-- ```{r, echo = FALSE, fig.width = 10, fig.height = 5, fig.align = 'center', message = FALSE, warning = FALSE} -->
<!-- library('dplyr') -->
<!-- library('ggpubr') -->
<!-- library('ggplot2') -->
<!-- set.seed(1) -->
<!-- dat <- readRDS('stan-compiled/Evans-dat.RDS') -->
<!-- datf <- filter(dat, task == 2, id %in% sample(1:36, replace = TRUE, size = 4)) -->
<!-- ggplot(datf, aes(x = trial, y = RT)) + -->
<!-- geom_point(size = 1, alpha = .4) + -->
<!-- facet_wrap(~ id, scales = 'free', nrow = 2) + theme_pubclean() + -->
<!-- xlab('Trial') + -->
<!-- ylab('Reaction Time (sec)') -->
<!-- ``` -->
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we took a closer look at what has been dubbed the <em>law of practice</em>: the empirical fact that reaction time decreases as an exponential function (previously: power law) with practice. We compared two perspectives on prediction: one based on marginal likelihoods, and one based on leave-one-out cross-validation. The latter allows the model to gorge on data, update its parameters, and then make predictions based on the <em>posterior predictive distribution</em>, while the former forces the model to make predictions using the <em>prior predictive distribution</em>. We have implemented the power law and exponential model in Stan, and extended the latter to model an initial learning plateau and account for the empirical observation that not only mean reaction time decreases, but also its variance.</p>
<hr />
<p><em>I would like to thank Nathan Evans, Quentin Gronau, and Don van den Bergh for helpful comments on this blog post.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Blitzstein, J. K., & Hwang, J. (<a href="https://projects.iq.harvard.edu/stat110/home">2014</a>). <em>Introduction to Probability</em>. London, UK: Chapman and Hall/CRC.</li>
<li>Evans, N. J., Brown, S. D., Mewhort, D. J., & Heathcote, A. (<a href="https://psycnet.apa.org/record/2018-30695-005">2018</a>). Refining the law of practice. <em>Psychological Review, 125</em>(4), 592-605.</li>
<li>Fong, E., & Holmes, C. (<a href="https://arxiv.org/abs/1905.08737">2019</a>). On the marginal likelihood and cross-validation. arXiv preprint arXiv:1905.08737.</li>
<li>Gelman, A., & Shalizi, C. R. (<a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2044-8317.2011.02037.x">2013</a>). Philosophy and the practice of Bayesian statistics. <em>British Journal of Mathematical and Statistical Psychology, 66</em>(1), 8-38.</li>
<li>Gronau, Q. F., & Wagenmakers, E. J. (<a href="https://link.springer.com/article/10.1007/s42113-018-0011-7">2019a</a>). Limitations of Bayesian leave-one-out cross-validation for model selection. <em>Computational Brain & Behavior, 2</em>(1), 1-11.</li>
<li>Gronau, Q. F., & Wagenmakers, E. J. (<a href="https://link.springer.com/article/10.1007/s42113-018-0022-4">2019b</a>). Rejoinder: More limitations of Bayesian leave-one-out cross-validation. Computational brain & behavior, 2(1), 35-47.</li>
<li>Heathcote, A., Brown, S., & Mewhort, D. J. K. (<a href="https://link.springer.com/article/10.3758/BF03212979">2000</a>). The power law repealed: The case for an exponential law of practice. <em>Psychonomic bulletin & Review, 7</em>(2), 185-207.</li>
<li>Lee, M. D., & Vanpaemel, W. (<a href="https://link.springer.com/article/10.3758/s13423-017-1238-3">2018</a>). Determining informative priors for cognitive models. <em>Psychonomic Bulletin & Review, 25</em>(1), 114-127.</li>
<li>Lee, M. D. (<a href="https://osf.io/zky2v/">2018</a>). Bayesian methods in cognitive modeling. <em>Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience, 5</em>, 1-48.</li>
<li>Myung, I. J., Kim, C., & Pitt, M. A. (<a href="https://link.springer.com/article/10.3758/BF03198418">2000</a>). Toward an explanation of the power law artifact: Insights from response surface analysis. <em>Memory & Cognition, 28</em>(5), 832-840.</li>
<li>Newell, A. (1973). You can’t play 20 questions with nature and win: Projective comments on the papers of this symposium. In W. G. Chase (Ed.), <em>Visual information processing</em> (pp. 283-308). New York, US: Academic Press.</li>
<li>Newell, A., & Rosenbloom, P. S. (1981). Mechanisms of skill acquisition and the law of practice. In J.R. Anderson (Ed.), <em>Cognitive skills and their acquisition</em> (pp. 1-55). Hillsdale, NJ: Erlbaum.</li>
<li>Vehtari, A., Gelman, A., & Gabry, J. (<a href="https://arxiv.org/abs/1507.02646">2015</a>). Pareto smoothed importance sampling. arXiv preprint arXiv:1507.02646.</li>
<li>Vehtari, A., Simpson, D. P., Yao, Y., & Gelman, A. (<a href="https://link.springer.com/article/10.1007/s42113-018-0020-6">2019</a>). Limitations of “Limitations of Bayesian leave-one-out cross-validation for model selection”. <em>Computational Brain & Behavior, 2</em>(1), 22-27.</li>
<li>Vehtari, A., Gelman, A., & Gabry, J. (<a href="https://link.springer.com/article/10.1007/s11222-016-9696-4">2017</a>). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. <em>Statistics and Computing, 27</em>(5), 1413-1432.</li>
<li>Wagenmakers, E. J., & Brown, S. (<a href="http://www.ejwagenmakers.com/2007/WagenmakersBrown2007.pdf">2007</a>). On the linear relation between the mean and the standard deviation of a response time distribution. <em>Psychological Review, 114</em>(3), 830-841.</li>
<li>Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (<a href="https://projecteuclid.org/euclid.ba/1516093227">2018</a>). Using stacking to average Bayesian predictive distributions (with discussion). <em>Bayesian Analysis, 13</em>(3), 917-1003.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<!-- There are at least two ways to see this. For simplicity, let $\alpha = 0$. Taking logarithms on both sides, the two equations become: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{log} \, f_e(N) &= \text{log} \, \beta - r N \hspace{3em} (1) \\[.5em] -->
<!-- \text{log} \, f_p(N) &= \text{log} \, \beta - r \, \text{log} \, N \hspace{1.4em} (2) -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- We see that the power law model (2) is a linear function in $r$ (on the log scale). The exponential model (1), in contrast, is a non-linear function in $r$ (on the log scale), converging to the asymptote much quicker than the power law model. Moreover, the exponential model allows only for a constant difference between the reaction times on different trials (on the log scale); the difference in log reaction time on one trial and the trial right after is $r$. In contrast, the power model (2) allows that the difference scales in $N$. In the beginning of the practice trials, the difference between trials is comparatively large; for example, $\text{log}(1) - \text{log}(2) = -0.69$. With increasing $N$, the differences between the log reaction times gets smaller and smaller; for example, $\text{log}(10) - \text{log}(11) = -0.10$. Therefore, learning slows down with increased practice. -->
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This blog post is heavily based on the modeling work of Evans et al. (2018). If you want to know more, I encourage you to check out the paper! <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>You may also rewrite the power law as an exponential function, i.e., $f_p(N) = \alpha + \beta e^{-r \, \text{log} N}$, to see their algebraic difference more clearly. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>As it turns out, there is a connection between the two: the “[…] the marginal likelihood is formally equivalent to exhaustive leave-$p$-out cross-validation averaged over all values of $p$ and all held-out test sets when using the log posterior predictive probability as a scoring rule.” (Fong & Holmes, 2019). <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>Note that, if a simpler model is not ruled out by the data, then it might be reasonable to obtain evidence in favour of it, even though a more complex model has generated the data. For $n \rightarrow \infty$, we would probably still want to have consistent model selection; that is, select the model which actually generated the data, assuming that it is in the set of models we are considering. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>For data points for which $k > 0.7$, Vehtari et al. (2017) suggest to compute the predictive density directly, instead of relying on Pareto smoothed importance sampling. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>I am not sure whether Vehtari would like this “Stacked Pseudo Bayes factor”. <a href="https://discourse.mc-stan.org/t/interpreting-elpd-diff-loo-package/1628/2">Here</a>, he seems to suggest that one should choose a model when it’s stacking weight is 1. Otherwise, I suppose his philosophy is aligned with that of Gelman & Shalizi (2013), i.e., expand the model so that it can account for whatever the other model does better. Update: <a href="https://twitter.com/avehtari/status/1134121009282539521">Here’s</a> Vehtari himself. Three cheers for the web! <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>Although there is a temporal order in the data, as trial 21 cannot, for example, come before trial 12, we are not particularly interested in predicting the future using past observations. Therefore, using vanilla loo seems adequate. If one is interested in predicting future observations, one could use approximate <em>leave-future-out</em> cross-validation, see Bürkner, Gabry, & Vehtari (<a href="https://arxiv.org/abs/1902.06281">2019</a>). <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>Michael Lee argues that this idea of trying to recover the “true” model is misguided. Bayes’ rule gives the optimal means to select among models. Given a particular data set, if the Bayes factor favours a model that did, in fact, not generate the data, it is still correct to select this model. After all, it predicts the data better than the model that generated it. Lee (2018) distinguishes between <em>inference</em> (saying what follows from a model and data) and <em>inversion</em> (recovering truth). <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>As Lee (2018, p. 27) points out, the word “fit” is unhelpful in a Bayesian context. There are no degrees of freedom; once the model is specified, inference follows directly from probability theory. So “updating a model” is better terminology than “fitting a model”. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>One can also do this without writing a custom function by using the standard lognormal function, and then just subtracting the shift parameter $\delta$ in the function call. <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderPractice makes better. And faster. But what exactly is the relation between practice and reaction time? In this blog post, we will focus on two contenders: the power law and exponential function. We will implement these models in Stan and extend them to account for learning plateaus and the fact that, with increased practice, not only the mean reaction time but also its variance decreases. We will contrast two perspectives on predictive model comparison: a (prior) predictive perspective based on marginal likelihoods, and a (posterior) predictive perspective based on leave-one-out cross-validation. So let’s get started!1 Two models We can model the relation between reaction time (in seconds) and the number of practice trials as a power law function. Let $f: \mathbb{N} \rightarrow \mathbb{R}^+$ be a function that maps the number of trials to reaction times. We write where $\alpha$ is a lower bound (one cannot respond faster than that due to processing and motor control limits); $\beta$ is the learning gain from practice with respect to the first trial ($N = 1$); $N$ indexes the particular trial; and $r$ is the learning rate. Similarly, we can write where the parameters have the same interpretation, except that $\beta$ is the learning gain from practice compared to no practice ($N = 0$).2 What is the main difference between those two functions? The exponential model assumes a constant learning rate, while the power model assumes diminishing returns. To see this, let $\alpha = 0$, and ignore for the moment that $N$ is discrete. Taking the derivative for the power law model results in which shows that the local learning rate — the change in reaction time as a function of $N$ — is $-r/N$; it depends on how many trials have been completed previously. The more one has practiced, the smaller the local learning rate $-r / N$. The exponential function, in contrast, shows no such dependency on practice: The figure above visualizes two data sets, generated from either a power law (left) or an exponential model (right), as well as the maximum likelihood fit of both models to these data. It is rather difficult to tell which model performs best just by eyeballing the fit. We thus need to engage in a more formal way of comparing models. Two perspectives on prediction Let’s agree that the best way to compare models is to look at predictive accuracy. Predictive accuracy with respect to what? The figure below illustrates two different answers one might give. The grey shaded surface represents unobserved data; the white island inside is the observed data; the model is denoted by $\mathcal{M}$. On the left, the model makes predictions before seeing any data by means of its prior predictive distribution. The predictive accuracy is then evaluated on the actually observed data. In contrast, on the right, the model makes predictions after seeing the data by means of its posterior predictive distribution. In principle, its predictive accuracy is evaluated on data one does not observe (visualized as the grey area). One can estimate this expected out-of-sample predictive accuracy by cross-validation procedures which partition the observed data into a training and test set. The model only sees the training set, and makes predictions for the unseen test set. One key practical distinction between these two perspectives is how predictions are generated. On the left, predictions are generated from the prior. On the right, the prior first gets updated to the posterior using the observed data, and it is through the posterior that predictions are made. In the next two sections, we make this difference more precise.3 Prior prediction: Marginal likelihoods From this perspective, we weight the model’s prediction of the observed data given a particular parameter setting by the prior. This is accomplished by integrating the likelihood with respect to the prior, which gives the so-called marginal likelihood of a model $\mathcal{M}$: It is clear that the prior matters a great deal, but that is no surprise — it is part of the model. A bad prior means a bad model. The ratio of two such marginal likelihoods is known as the Bayes factor. If one is willing to assign priors to models, one can compute posterior model probabilities, i.e., where $K$ is the number of models under consideration (see also a previous blogpost). Observe that the marginal likelihood features prominently: it is an updating factor from prior to posterior model probability. With this, one can also compute Bayesian model-averaged predictions: where $\tilde{y}$ is unseen data, and where the prediction of each model gets weighted by its posterior probability. We denote a model weight, which in this case is its posterior probability, as $w_k$. Posterior prediction: Leave-one-out cross-validation Another perspective aims to estimate the expected out-of-sample prediction error, or expected log predictive density, i.e., where the expectation is taken with respect to unseen data $\tilde{y}$ (which is visualized as a grey surface with an $?$ inside in the figure above). Clearly, as we do not have access to unseen data, we cannot evaluate this. However, one can approximate this quantity by computing the leave-one-out prediction error in our sample: where $y_i$ is the $i^{\text{th}}$ data point, and $y_{-i}$ are all data points except $y_i$, where we have suppressed conditioning on $\mathcal{M}$ to not clutter notation (even more), and where Observe that this requires integrating over the posterior distribution of $\theta$ given all but one data point; this is in contrast to the marginal likelihood perspective, which requires integration with respect to the prior distribution. From this perspective, one can similarly compute model weights $w_k$ where $\widehat{\text{elpd}}_{\text{loo}}^{\mathcal{M}_k}$ is the loo estimate for the expected log predictive density for model $\mathcal{M}_k$. For prediction, one averages across models using these Pseudo Bayesian model-averaging weights (Yao et al., 2018, p. 92). However, Yao et al. (2018) and Vehtari et al. (2019) recommend against using these Pseudo BMA weights, as they do not take the uncertainty of the loo estimates into account. Instead, they suggest using Pseudo-BMA+ weights or stacking. For details, see Yao et al. (2018). For an illuminating discussion about model selection based on marginal likelihoods or leave-one-out cross-validation, see Gronau & Wagenmakers (2019a), Vehtari et al. (2019), and Gronau & Wagenmakers (2019b). Now that we have taken a look at these two perspectives on prediction, in the next section, we will implement the power law and the exponential model in Stan. Implementation in Stan As is common in psychology, people do not deterministically follow a power law or an exponential law. Instead, the law is probabilistic: given the same task, the person will respond faster or slower, never exactly as before. To allow for this, we assume that there is Gaussian noise around the function value. In particular, we assume that Note that are not normally distributed; we address this later. We make the same assumption for the exponential model. The following code implements the power law model in Stan. This blog post is heavily based on the modeling work of Evans et al. (2018). If you want to know more, I encourage you to check out the paper! ↩ You may also rewrite the power law as an exponential function, i.e., $f_p(N) = \alpha + \beta e^{-r \, \text{log} N}$, to see their algebraic difference more clearly. ↩ As it turns out, there is a connection between the two: the “[…] the marginal likelihood is formally equivalent to exhaustive leave-$p$-out cross-validation averaged over all values of $p$ and all held-out test sets when using the log posterior predictive probability as a scoring rule.” (Fong & Holmes, 2019). ↩Two perspectives on regularization2019-04-15T12:00:00+00:002019-04-15T12:00:00+00:00https://fabiandablander.com/r/Regularization<p>Regularization is the process of adding information to an estimation problem so as to avoid extreme estimates. Put differently, it safeguards against foolishness. Both Bayesian and frequentist methods can incorporate prior information which leads to regularized estimates, but they do so in different ways. In this blog post, I illustrate these two different perspectives on regularization on the simplest example possible — estimating the bias of a coin.</p>
<!-- When I am too lazy to cook porridge, I usually buy bread from the local bakery and have bread with (vegan) butter for breakfast. Assume I am unusually clumsy, and my freshly spread slice of bread slips out of my hand, onto the floor. Did the butter land on the floor? Yes! How can we model this process? -->
<h2 id="modeling-coin-flips">Modeling coin flips</h2>
<p>Let’s say that we are interested in estimating the bias of a coin, which we take to be the probability of the coin showing heads.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> In this section, we will derive the Binomial likelihood — the statistical model that we will use for modeling coin flips. Let $X \in [0, 1]$ be a discrete random variable with realization $X = x$. Flipping the coin once, let the outcome $x = 0$ correspond to tails and $x = 1$ to heads. We use the Bernoulli likelihood to connect the data to the latent parameter $\theta$, which we take to be the bias of the coin:</p>
<script type="math/tex; mode=display">p(x \mid \theta) = \theta^x (1 - \theta)^{1 - x} \enspace .</script>
<p>There is no point in estimating the bias by flipping the coin only once. We are therefore interested in a model that can account for $n$ coin flips. If we are willing to assume that the individual coin flips are <em>independent and identically</em> distributed conditional on $\theta$, we can obtain the joint probability of all outcomes by multiplying the probability of the individual outcomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(x_1, \ldots, x_n \mid \theta) &= \prod_{i=1}^n p(x_i \mid \theta) \\[.5em]
&= \prod_{i=1}^n \theta^{x_i} (1 - \theta)^{1 - x_i} \\[.5em]
&= \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{ \sum_{i=1}^n 1 - x_i} \enspace .
\end{aligned} %]]></script>
<p>For the purposes of estimating the coin’s bias, we actually do not care about the order in which the coins come up heads or tails; we only care about how frequently the coin shows heads or tails out of $n$ throws. Thus, we do not model the individual outcomes $X_i$, but instead model their sum $Y = \sum_{i=1}^n X_i$. We write:</p>
<script type="math/tex; mode=display">p(y \mid \theta) = \theta^{y} (1 - \theta)^{n - y} \enspace ,</script>
<p>where we suppress conditioning on $n$ to not clutter notation. Note that our model is not complete — we need to account for the fact that there are several ways to get $y$ heads out of $n$ throws. For example, we can get $y = 2$ with $n = 3$ in three different ways: $(1, 1, 0)$, $(0, 1, 1)$, and $(1, 0, 1)$. If we were to use the model above, we would underestimate the probability of observing two heads out of three coin tosses by a factor of three.</p>
<p>In general, there are $n!$ possible ways in which we can order the outcomes. To see this, think of $n$ containers. The first outcome can go in any container, the second one in any container but the container which houses the first outcome, and so on, which yields:</p>
<script type="math/tex; mode=display">n \times (n - 1) \times (n - 2) \ldots \times 1 = n! \enspace .</script>
<p>However, we only care about $y$ of them, so we need to remove the remaining $(n - y)!$ possible ways. Moreover, once we have taken $y$ outcomes, we do not care about <em>their</em> order; thus we remove another $y!$ permutations. Therefore, for any particular sequence of coin flips of length $n$, there are</p>
<script type="math/tex; mode=display">\frac{n!}{y!(n - y)!} = {n \choose y}</script>
<p>ways to get $y$ heads out of $n$ throws. The funny looking symbol on the right is the <em>Binomal coefficient</em>. The probability of the data is therefore given by the Binomial likelihood:</p>
<script type="math/tex; mode=display">p(y \mid \theta) = {n \choose y} \theta^y (1 - \theta)^{n - y} \enspace ,</script>
<p>which just adds the term ${n \choose y}$ to the equation we had above after introducing $Y$. For the example of observing $y = 2$ heads out of $n = 3$ coin flips, the Binomial coefficient is ${3 \choose 2} = 3$, which accounts for the fact that there are three possible ways to get two heads out of three throws.</p>
<h2 id="the-data">The data</h2>
<p>Assume we flip the coin three times, $n = 3$, and observe three heads, $y = 3$. How can we estimate the bias of the coin? In the next sections, we will use the Binomial likelihood derived above and discuss three different ways of estimating the coin’s bias: maximum likelihood estimation, Bayesian estimation, and penalized maximum likelihood estimation.</p>
<h2 id="classical-estimation">Classical estimation</h2>
<p>Within the frequentist paradigm, the method of maximum likelihood is arguably the most popular method for parameter estimation: choose as an estimate for $\theta$ the value which maximizes the likelihood of the data.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> To get a feeling for how the likelihood of the data differs across values of $\theta$, let’s pick two values, $\theta_1 = .5$ and $\theta_2 = 1$, and compute the likelihood of observing three heads out of three coin flips:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y = 3 \mid \theta = .5) &= {3 \choose 3} .5^3 (1 - .5)^{3 - 3} = 0.125 \\[.5em]
p(y = 3 \mid \theta = 1) &= {3 \choose 3} 1^3 (1 - 1)^{3 - 3} = 1 \enspace .
\end{aligned} %]]></script>
<p>We therefore conclude that the data are more likely for a coin that has bias $\theta_1 = 1$ than for a coin that has bias $\theta_2 = 0.5$. But is it the <em>most</em> likely value? To compare all possible values for $\theta$ visually, we plot the likelihood as a function of $\theta$ below. The left figure shows that, indeed, $\theta = 1$ maximizes the likelihood for the data. The right figure shows the likelihood function for $y = 15$ heads out of $n = 20$ coin flips. Note that, in contrast to probabilities, which need to sum to one, likelihoods do not have a natural scale.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>Do these two examples allow us to derive a general principle for how to estimate the bias of a coin? Let $\hat{\theta}$ denote an estimate of the population parameter $\theta$. The two figures above suggests that $\hat{\theta} = \frac{y}{n}$ is the maximum likelihood estimate for an arbitrary data set $d = (y, n)$ … and it is! To arrive at this mathematically, we can find the maximum of this likelihood function by taking the derivative with respect to $\theta$, and setting it to zero (see also a <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">previous</a> post). In other words, we solve for the value of $\theta$ for which the derivative does not change; and since the Binomial likelihood is unimodal, this maximum will be unique. Note the value for $\theta$ at which the likelihood function has its maximum does not change when we take logs, but because the mathematics is greatly simplified, we do so:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= \frac{\partial}{\partial \theta}\text{log}\left({n \choose y} \theta^y (1 - \theta)^{n - y}\right) \\[.5em]
0 &= \frac{\partial}{\partial \theta}\left(\text{log}{n \choose y} + y \, \text{log}\theta + (n - y) \, \text{log}(1 - \theta)\right) \\[.5em]
0 &= \frac{y}{\theta} - \frac{n - y}{1 - \theta}\\[.5em]
\frac{n - y}{1 - \theta} &= \frac{y}{\theta} \\[.5em]
\theta (n - y) &= (1 - \theta) y \\[.5em]
\theta n - \theta y &= y - \theta y \\[.5em]
\theta n &= y \\[.5em]
\theta &= \frac{y}{n} \enspace ,
\end{aligned} %]]></script>
<p>which shows that indeed $\frac{y}{n}$ is the maximum likelihood estimate.</p>
<h2 id="bayesian-estimation">Bayesian estimation</h2>
<p>Bayesians assign priors to parameters in addition to the likelihood, which takes a central role in all statistical paradigms. For this Binomial problem, we assign $\theta$ a Beta prior:</p>
<script type="math/tex; mode=display">p(\theta) = \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \enspace .</script>
<p>As we will see below, this prior allows easy Bayesian updating while being sufficiently flexible in incorporating prior information. The figure below shows different Beta distributions, formalizing our prior belief about values of $\theta$.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>The figure in the top left corner assigns uniform prior plausibility to all values of $\theta$; the figures to its right incorporate a slight bias towards the extreme values $\theta = 1$ and $\theta = 0$. With increasing $a$ and $b,$ the prior becomes more biased towards $\theta = 0.5$; with decreasing $a$ and $b$, the prior becomes biased against $\theta = 0.5$.</p>
<p>As shown in a <a href="https://fdabl.github.io/r/Spike-and-Slab.html">previous</a> blog post, the Beta distribution is <em>conjugate</em> to the Binomial likelihood, which means that the posterior distribution of $\theta$ is again a Beta distribution:</p>
<script type="math/tex; mode=display">p(\theta \mid y) = \frac{1}{\text{B}(a', b')} \theta^{a' - 1} (1 - \theta)^{b' - 1} \enspace ,</script>
<p>where $a’ = a + y$ and $b’ = b + n - y$. Under this conjugate setup, the parameters of the prior can be understood as prior data; for example, if we choose prior parameters $a = b = 1$, then we assume that we have seen one heads and one tails prior to data collection. The figure below shows two examples of such Bayesian updating processes.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>In both cases, we observe $y = 3$ heads out of $n = 3$ coin flips. On the left, we assign $\theta$ a uniform prior. The resulting posterior distribution is proportional to the likelihood (which we have rescaled to fit nicely in the graph) and thus does not appear as a separate line. After we have seen the data, we can compute the posterior mode as our estimate for the most likely value of $\theta$. Observe that the posterior mode is equivalent to the maximum likelihood estimate:</p>
<script type="math/tex; mode=display">\hat{\theta}_{\text{PM}} = \frac{a' - 1}{a' + b' - 2} = \frac{1 + y - 1}{1 + y + 1 + n - y - 2} = \frac{y}{n} = \hat{\theta}_{\text{MLE}} \enspace .</script>
<p>This is in fact the case for all statistical estimation problems where we assign a uniform prior to the (possibly high-dimensional) parameter vector $\theta$. To prove this, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\hat{\theta}_{\text{PM}} &= \underset{\theta}{\text{argmax}} \, \frac{p(y \mid \theta) \, p(\theta)}{p(y)} \\[.5em]
&= \underset{\theta}{\text{argmax}} \, p(y \mid \theta) \\[.5em]
&= \hat{\theta}_{\text{MLE}} \enspace ,
\end{aligned} %]]></script>
<p>since we can drop the normalizing constant $p(y)$, because it does not depend on $\theta$, and $p(\theta)$, because it is a constant assigning all values of $\theta$ equal probability.</p>
<p>Using a Beta prior with $a = b = 2$, as shown on the right side of the figure above, we see that the posterior is not proportional to the likelihood anymore. This in turn means that the mode of the posterior distribution does no longer correspond to the maximum likelihood estimate. In this case, the posterior mode is:</p>
<script type="math/tex; mode=display">\hat{\theta}_{\text{PM}} = \frac{5 - 1}{5 + 2 - 2} = 0.8 \enspace .</script>
<p>In contrast to earlier, this estimate is <em>shrunk</em> towards $\theta = 0.5$. This came about because we have used prior information that stated that $\theta = 0.5$ is more likely than the other values (see figure with $a = b = 2$ above). Consequently, we were therefore less swayed by the somewhat unlikely situation (under no bias $\theta = 0.5$) of observing three heads out of three throws. It should thus not come as a surprise that Bayesian priors <em>can</em> act as regularizing devices. However, this requires careful application, especially in small sample size settings.</p>
<p>In a <em>Post Scriptum</em> to this blog post, I similarly show how the posterior mean, which is arguably are more natural point estimate as it takes the uncertainty about $\theta$ better into account than the posterior mode, can be viewed as a regularized estimate, too.</p>
<h2 id="penalized-estimation">Penalized estimation</h2>
<p>Bayesians are not the only ones who can add prior information to an estimation problem. Within the frequentist framework, penalized estimation methods add a penalty term to the log likelihood function, and then find the parameter value which maximizes this <em>penalized log likelihood</em>. We can implement such a method by optimizing an extended log likelihood:</p>
<script type="math/tex; mode=display">y \, \text{log}\,\theta + (n - y) \, \text{log} \, (1 - \theta) - \underbrace{\lambda (\theta - 0.5)^2}_{\text{Penalty Term}} \enspace ,</script>
<p>where we penalize values that a far from the parameter value which indicates no bias, $\theta = 0.5$. The larger $\lambda$, the stronger values of $\theta \neq 0.5$ get penalized. In addition to picking $\lambda$, the particular form of the penalty term is also important. Similar to assigning $\theta$ a prior distribution, although possibly less straightforward and less flexible, choosing the penalty term means incorporating information about the problem in addition to specifying a likelihood function. Above, we have used the <em>squared distance</em> from $\theta = 0.5$ as a penalty. We call this the $\mathcal{L}_2$-norm penalty<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>, but the $\mathcal{L}_1$-norm, which takes the <em>absolute distance</em>, is an equally interesting choice:</p>
<script type="math/tex; mode=display">y \, \text{log}\,\theta + (n - y) \, \text{log} \, (1 - \theta) - \lambda |\theta - 0.5| \enspace ,</script>
<p>As we will see below, these penalties have very different effects.</p>
<p>The penalized likelihood does not only depend on $\theta$, but also on $\lambda$. The code below evaluates the penalized log likelihood function given values for these two parameters. Note that we drop the normalizing constant ${n \choose y}$ as it does neither depend on $\theta$ nor on $\lambda$.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0.001</span><span class="p">,</span><span class="w"> </span><span class="m">.999</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">),</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">theta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">reg</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">get_penalized_likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Vectorize</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span></code></pre></figure>
<p>With only three data points it is futile to try to estimate $\lambda$ using, for example, cross-validation; however, this is also not the goal of this blog post. Instead, to get further intuition, we simply try out a number of values for $\lambda$ using the code below and see how it influences our estimate of $\theta$. Because the parameter space has only one dimension, we can easily find the value for $\theta$ which maximizes the penalized likelihood even without wearing our calculus hat. Specifically, given a particular value for $\lambda$, we evaluate the penalized likelihood function for a range of values of between zero and one and pick the value that minimizes it.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">estimate_path</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lambda_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">.01</span><span class="p">)</span><span class="w">
</span><span class="n">theta_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">.001</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">)</span><span class="w">
</span><span class="n">theta_best</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="nf">seq_along</span><span class="p">(</span><span class="n">lambda_seq</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">penalized_likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_penalized_likelihood</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">theta_seq</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_seq</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">reg</span><span class="p">)</span><span class="w">
</span><span class="n">theta_seq</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">penalized_likelihood</span><span class="p">)]</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="n">cbind</span><span class="p">(</span><span class="n">lambda_seq</span><span class="p">,</span><span class="w"> </span><span class="n">theta_best</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Sticking with the observations of three heads ($y = 3$) out of three throws ($n = 3$), the figure below plots best fitting values for $\theta$ given a range of values for $\lambda$. Observe that the $\mathcal{L}_1$-norm penalty shrinks it more quicker and abruptly to $\theta = 0.5$ at $\lambda = 6$, while the $\mathcal{L}_2$-norm penalty gradually (and rather slowly) shrinks the parameter to $\theta = 0.5$ with increasing $\lambda$. Why is this so?</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-9-1.png" title="plot of chunk unnamed-chunk-9" alt="plot of chunk unnamed-chunk-9" style="display: block; margin: auto;" /></p>
<p>First, note that because $\theta \in [0, 1]$ the squared distance will always be smaller than the absolute distance, which explains the slower shrinkage. Second, the fact that the $\mathcal{L}_1$-norm penalty can shrink <em>exactly</em> to $\theta = 0.5$ is due to the discontinuity of the absolute value function. The figures below provides some intuition. In particular, the figure on the left shows the $\mathcal{L}_1$-norm penalized likelihood function for a select number of $\lambda$’s. We see that for $\lambda < 3$, the value $\theta = 1$ performs best. With $\lambda \in [3, 6]$, values of $\theta \in [0.5, 1]$ become more likely than the extreme estimate $\theta = 1$. For $\lambda \geq 6$, the ‘no bias’ value $\theta = 0.5$ maximizes the penalized likelihood. Due to the discontinuity in the penalty, the shrinkage is exact. The $\mathcal{L}_2$-norm penalty, on the other hand, shrinks less strongly, and never exactly to $\theta = 0.5$, except of course for $\lambda \rightarrow \infty$. We can see this in the right figure below, where the penalized likelihood function is merely shifted to the left with increasing $\lambda$; this is in contrast to the $\mathcal{L}_1$-norm penalized likelihood on the left, for which the value $\theta = 0.5$ at the discontinuity takes a special place.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-10-1.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" style="display: block; margin: auto;" /></p>
<p>You can play around with the code below to get an intuition for how different values of $\lambda$ influence the penalized likelihood function.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'latex2exp'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_pen_llh</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Penalized Likelihood'</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">nl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">lambdas</span><span class="p">)</span><span class="w">
</span><span class="n">theta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">.001</span><span class="p">,</span><span class="w"> </span><span class="m">.999</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">)</span><span class="w">
</span><span class="n">likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nl</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">theta</span><span class="p">))</span><span class="w">
</span><span class="n">normalize</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="nf">max</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">nl</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">log_likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_penalized_likelihood</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">reg</span><span class="p">)</span><span class="w">
</span><span class="n">likelihood</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">normalize</span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">log_likelihood</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="n">likelihood</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylab</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="s1">'$\\theta$'</span><span class="p">),</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w">
</span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'skyblue'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nl</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="n">likelihood</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylab</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w">
</span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'skyblue'</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">at</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.2</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s1">'$\\lambda = %.2f$'</span><span class="p">,</span><span class="w"> </span><span class="n">l</span><span class="p">)))</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s1">'topleft'</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">info</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">nl</span><span class="p">),</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">box.lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'skyblue'</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">lambdas</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="m">8</span><span class="p">)</span><span class="w">
</span><span class="n">plot_pen_llh</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="s1">'$L_1$ Penalized Likelihood'</span><span class="p">))</span><span class="w">
</span><span class="n">plot_pen_llh</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="s1">'$L_2$ Penalized Likelihood'</span><span class="p">))</span></code></pre></figure>
<p>In practice, one would reparameterize this model as a logistic regression, and use cross-validation to estimate the best value for $\lambda$; see the <em>Post Scriptum</em> for a sketch of this approach.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen two perspectives regularization illustrated on a very simple example: estimating the bias of a coin. We first derived the Binomial likelihood, connecting the data to a parameter $\theta$ which we took to be the bias of the coin, as well as the maximum likelihood estimate. Observing three heads out of three coin flips, we became slightly uncomfortable with the (extreme) estimate $\hat{\theta} = 1$. We have seen how, from a Bayesian perspective, one can add prior information to this estimation problem, and how this led to an estimate that was <em>shrunk</em> towards $\theta = 0.5$. Within the frequentist framework, one can add information by augmenting the likelihood function with a penalty term. The type of information we want to incorporate corresponds to the particular penalty term. In this blog post, we have focused on the most commonly used penalty terms: the $\mathcal{L}_1$-norm, which shrinks parameters exactly to a particular value; and the $\mathcal{L}_2$-norm penalty, which provides continuous shrinkage. A future blog post might look into linear regression models where regularization methods abound and study how, for example, the popular Lasso can be recast in Bayesian terms.</p>
<hr />
<p><em>I would like to thank Jonas Haslbeck, Don van den Bergh, and Sophia Crüwell for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="posterior-mean">Posterior mean</h3>
<p>You may argue that one should use the mean instead of the mode as a posterior summary measure. If one does this, then there is already some shrinkage for the case of uniform priors. The mean of the posterior distribution is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[\theta] &= \frac{a'}{a' + b'} \\[.5em]
&= \frac{a + y}{a + y + b + n - y} \\[.5em]
&= \frac{a + y}{a + b + n} \enspace .
\end{aligned} %]]></script>
<p>As so often in mathematics, we can rewrite this in a more complicated manner to gain insight into how Bayesian priors shrink estimates:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[\theta] &= \frac{a}{a + b + n} + \frac{y}{a + b + n} \\[.5em]
&= \frac{a}{a + b + n} \left(\frac{a + b}{a + b}\right) + \frac{y}{a + b + n} \left( \frac{n}{n} \right) \\[.5em]
&= \frac{a + b}{a + b + n} \underbrace{\left(\frac{a}{a + b}\right)}_{\text{Prior mean}} + \frac{n}{a + b + n} \underbrace{\left( \frac{y}{n} \right)}_{\text{MLE}} \enspace .
\end{aligned} %]]></script>
<p>This decomposition shows that the posterior mean is a weighted combination of the prior mean and the maximum likelihood estimate. Since we can think of $a + b$ as the prior data, note that $a + b + n$ can be thought of as the <em>total</em> number of data points. The prior mean is thus weighted be the proportion of prior to total data, while the maximum likelihood estimate is weighted by the proportion of sample data to total data. This provides another perspective on how Bayesian priors regularize estimates.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<h3 id="penalized-logistic-regression">Penalized logistic regression</h3>
<p>Cross-validation might be a bit awkward when we represent the data using only $y$ and $n$. We can go back to the product of Bernoulli representation, which uses all individual data points $x_i$. This results in a logistic regression problem with likelihood:</p>
<script type="math/tex; mode=display">p(x_1, \ldots, x_n \mid \beta) = \prod_{i=1}^n \left(\frac{1}{1 + \text{exp}^{-\beta}}\right)^{x_i} \left(1 - \frac{1}{1 + \text{exp}^{-\beta}}\right)^{1 - x_i}\enspace ,</script>
<p>where we use a sigmoid function as the link function, and $\beta$ is on the log odds scale. The penalized log likelihood function can be written as</p>
<script type="math/tex; mode=display">\sum_{i=1}^n \left[ x_i \, \text{log} \left(\frac{1}{1 + \text{exp}^{-\beta}}\right) + (1 - x_i) \left(1 - \frac{1}{1 + \text{exp}^{-\beta}}\right) \right] - \lambda |\beta| \enspace ,</script>
<p>where because $\beta = 0$ corresponds to $\theta = 0.5$, we do not need to subtract in the penalty term. This parameterization also makes it more easy to study which types of priors on $\beta$ result in an $\mathcal{L}_1$ or $\mathcal{L}_2$ norm penalty (spoiler: it’s a Laplace and the Gaussian, respectively). Such models can be estimated using the R package <em>glmnet</em>, although it does not work for the exceedingly small sample we have played with in this blog post. This seems to imply that regularization is more natural in the Bayesian framework, which additionally allows more flexible specification of prior knowledge.</p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Gelman, A., & Nolan, D. (<a href="https://www.tandfonline.com/doi/abs/10.1198/000313002605">2002</a>). You can load a die, but you can’t bias a coin. <em>The American Statistician, 56</em>(4), 308-311.</li>
<li>Stigler, S. M. (<a href="https://projecteuclid.org/euclid.ss/1207580174">2007</a>). The Epic Story of Maximum Likelihood. <em>Statistical Science, 22</em>(4), 598-620.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I don’t think anybody actually ever is interested in estimating the bias of a coin. In fact, one <em>cannot bias a coin</em> if we are only allowed to flip it in the usual manner (see Gelman & Nolan, <a href="https://www.tandfonline.com/doi/abs/10.1198/000313002605">2002</a>). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>In a wonderful paper humbly titled <em>The Epic Story of Maximum Likelihood</em>, Stigler (<a href="https://projecteuclid.org/euclid.ss/1207580174">2007</a>) says that maximum likelihood estimation must have been familiar even to hunters and gatherers, although they would not have used such fancy words, as the idea is exceedingly simple. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Strictly speaking, this is incorrect: the only norm that exists for the one-dimensional vector space is the absolute value norm. Thus, in our example with only one parameter $\theta$ there is no notion of an $\mathcal{L}_2$-norm. However, because of the analogy to the regression and more generally multidimensional setting, I hope that this inaccuracy is excused. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>It also shows that in the limit of infinite data, the posterior mean converges to the maximum likelihood estimate. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderRegularization is the process of adding information to an estimation problem so as to avoid extreme estimates. Put differently, it safeguards against foolishness. Both Bayesian and frequentist methods can incorporate prior information which leads to regularized estimates, but they do so in different ways. In this blog post, I illustrate these two different perspectives on regularization on the simplest example possible — estimating the bias of a coin. Modeling coin flips Let’s say that we are interested in estimating the bias of a coin, which we take to be the probability of the coin showing heads.1 In this section, we will derive the Binomial likelihood — the statistical model that we will use for modeling coin flips. Let $X \in [0, 1]$ be a discrete random variable with realization $X = x$. Flipping the coin once, let the outcome $x = 0$ correspond to tails and $x = 1$ to heads. We use the Bernoulli likelihood to connect the data to the latent parameter $\theta$, which we take to be the bias of the coin: There is no point in estimating the bias by flipping the coin only once. We are therefore interested in a model that can account for $n$ coin flips. If we are willing to assume that the individual coin flips are independent and identically distributed conditional on $\theta$, we can obtain the joint probability of all outcomes by multiplying the probability of the individual outcomes: For the purposes of estimating the coin’s bias, we actually do not care about the order in which the coins come up heads or tails; we only care about how frequently the coin shows heads or tails out of $n$ throws. Thus, we do not model the individual outcomes $X_i$, but instead model their sum $Y = \sum_{i=1}^n X_i$. We write: where we suppress conditioning on $n$ to not clutter notation. Note that our model is not complete — we need to account for the fact that there are several ways to get $y$ heads out of $n$ throws. For example, we can get $y = 2$ with $n = 3$ in three different ways: $(1, 1, 0)$, $(0, 1, 1)$, and $(1, 0, 1)$. If we were to use the model above, we would underestimate the probability of observing two heads out of three coin tosses by a factor of three. In general, there are $n!$ possible ways in which we can order the outcomes. To see this, think of $n$ containers. The first outcome can go in any container, the second one in any container but the container which houses the first outcome, and so on, which yields: However, we only care about $y$ of them, so we need to remove the remaining $(n - y)!$ possible ways. Moreover, once we have taken $y$ outcomes, we do not care about their order; thus we remove another $y!$ permutations. Therefore, for any particular sequence of coin flips of length $n$, there are ways to get $y$ heads out of $n$ throws. The funny looking symbol on the right is the Binomal coefficient. The probability of the data is therefore given by the Binomial likelihood: which just adds the term ${n \choose y}$ to the equation we had above after introducing $Y$. For the example of observing $y = 2$ heads out of $n = 3$ coin flips, the Binomial coefficient is ${3 \choose 2} = 3$, which accounts for the fact that there are three possible ways to get two heads out of three throws. The data Assume we flip the coin three times, $n = 3$, and observe three heads, $y = 3$. How can we estimate the bias of the coin? In the next sections, we will use the Binomial likelihood derived above and discuss three different ways of estimating the coin’s bias: maximum likelihood estimation, Bayesian estimation, and penalized maximum likelihood estimation. Classical estimation Within the frequentist paradigm, the method of maximum likelihood is arguably the most popular method for parameter estimation: choose as an estimate for $\theta$ the value which maximizes the likelihood of the data.2 To get a feeling for how the likelihood of the data differs across values of $\theta$, let’s pick two values, $\theta_1 = .5$ and $\theta_2 = 1$, and compute the likelihood of observing three heads out of three coin flips: We therefore conclude that the data are more likely for a coin that has bias $\theta_1 = 1$ than for a coin that has bias $\theta_2 = 0.5$. But is it the most likely value? To compare all possible values for $\theta$ visually, we plot the likelihood as a function of $\theta$ below. The left figure shows that, indeed, $\theta = 1$ maximizes the likelihood for the data. The right figure shows the likelihood function for $y = 15$ heads out of $n = 20$ coin flips. Note that, in contrast to probabilities, which need to sum to one, likelihoods do not have a natural scale. Do these two examples allow us to derive a general principle for how to estimate the bias of a coin? Let $\hat{\theta}$ denote an estimate of the population parameter $\theta$. The two figures above suggests that $\hat{\theta} = \frac{y}{n}$ is the maximum likelihood estimate for an arbitrary data set $d = (y, n)$ … and it is! To arrive at this mathematically, we can find the maximum of this likelihood function by taking the derivative with respect to $\theta$, and setting it to zero (see also a previous post). In other words, we solve for the value of $\theta$ for which the derivative does not change; and since the Binomial likelihood is unimodal, this maximum will be unique. Note the value for $\theta$ at which the likelihood function has its maximum does not change when we take logs, but because the mathematics is greatly simplified, we do so: which shows that indeed $\frac{y}{n}$ is the maximum likelihood estimate. Bayesian estimation Bayesians assign priors to parameters in addition to the likelihood, which takes a central role in all statistical paradigms. For this Binomial problem, we assign $\theta$ a Beta prior: As we will see below, this prior allows easy Bayesian updating while being sufficiently flexible in incorporating prior information. The figure below shows different Beta distributions, formalizing our prior belief about values of $\theta$. The figure in the top left corner assigns uniform prior plausibility to all values of $\theta$; the figures to its right incorporate a slight bias towards the extreme values $\theta = 1$ and $\theta = 0$. With increasing $a$ and $b,$ the prior becomes more biased towards $\theta = 0.5$; with decreasing $a$ and $b$, the prior becomes biased against $\theta = 0.5$. As shown in a previous blog post, the Beta distribution is conjugate to the Binomial likelihood, which means that the posterior distribution of $\theta$ is again a Beta distribution: where $a’ = a + y$ and $b’ = b + n - y$. Under this conjugate setup, the parameters of the prior can be understood as prior data; for example, if we choose prior parameters $a = b = 1$, then we assume that we have seen one heads and one tails prior to data collection. The figure below shows two examples of such Bayesian updating processes. In both cases, we observe $y = 3$ heads out of $n = 3$ coin flips. On the left, we assign $\theta$ a uniform prior. The resulting posterior distribution is proportional to the likelihood (which we have rescaled to fit nicely in the graph) and thus does not appear as a separate line. After we have seen the data, we can compute the posterior mode as our estimate for the most likely value of $\theta$. Observe that the posterior mode is equivalent to the maximum likelihood estimate: This is in fact the case for all statistical estimation problems where we assign a uniform prior to the (possibly high-dimensional) parameter vector $\theta$. To prove this, observe that: since we can drop the normalizing constant $p(y)$, because it does not depend on $\theta$, and $p(\theta)$, because it is a constant assigning all values of $\theta$ equal probability. Using a Beta prior with $a = b = 2$, as shown on the right side of the figure above, we see that the posterior is not proportional to the likelihood anymore. This in turn means that the mode of the posterior distribution does no longer correspond to the maximum likelihood estimate. In this case, the posterior mode is: In contrast to earlier, this estimate is shrunk towards $\theta = 0.5$. This came about because we have used prior information that stated that $\theta = 0.5$ is more likely than the other values (see figure with $a = b = 2$ above). Consequently, we were therefore less swayed by the somewhat unlikely situation (under no bias $\theta = 0.5$) of observing three heads out of three throws. It should thus not come as a surprise that Bayesian priors can act as regularizing devices. However, this requires careful application, especially in small sample size settings. In a Post Scriptum to this blog post, I similarly show how the posterior mean, which is arguably are more natural point estimate as it takes the uncertainty about $\theta$ better into account than the posterior mode, can be viewed as a regularized estimate, too. Penalized estimation Bayesians are not the only ones who can add prior information to an estimation problem. Within the frequentist framework, penalized estimation methods add a penalty term to the log likelihood function, and then find the parameter value which maximizes this penalized log likelihood. We can implement such a method by optimizing an extended log likelihood: where we penalize values that a far from the parameter value which indicates no bias, $\theta = 0.5$. The larger $\lambda$, the stronger values of $\theta \neq 0.5$ get penalized. In addition to picking $\lambda$, the particular form of the penalty term is also important. Similar to assigning $\theta$ a prior distribution, although possibly less straightforward and less flexible, choosing the penalty term means incorporating information about the problem in addition to specifying a likelihood function. Above, we have used the squared distance from $\theta = 0.5$ as a penalty. We call this the $\mathcal{L}_2$-norm penalty3, but the $\mathcal{L}_1$-norm, which takes the absolute distance, is an equally interesting choice: As we will see below, these penalties have very different effects. The penalized likelihood does not only depend on $\theta$, but also on $\lambda$. The code below evaluates the penalized log likelihood function given values for these two parameters. Note that we drop the normalizing constant ${n \choose y}$ as it does neither depend on $\theta$ nor on $\lambda$. I don’t think anybody actually ever is interested in estimating the bias of a coin. In fact, one cannot bias a coin if we are only allowed to flip it in the usual manner (see Gelman & Nolan, 2002). ↩ In a wonderful paper humbly titled The Epic Story of Maximum Likelihood, Stigler (2007) says that maximum likelihood estimation must have been familiar even to hunters and gatherers, although they would not have used such fancy words, as the idea is exceedingly simple. ↩ Strictly speaking, this is incorrect: the only norm that exists for the one-dimensional vector space is the absolute value norm. Thus, in our example with only one parameter $\theta$ there is no notion of an $\mathcal{L}_2$-norm. However, because of the analogy to the regression and more generally multidimensional setting, I hope that this inaccuracy is excused. ↩