Jekyll2019-08-19T16:33:12+00:00https://fabiandablander.com/feed.xmlFabian DablanderPhD Student Methods & StatisticsFabian DablanderThe Fibonacci sequence and linear algebra2019-07-28T13:30:00+00:002019-07-28T13:30:00+00:00https://fabiandablander.com/r/Fibonacci<p>Leonardo Bonacci, better known as Fibonacci, has influenced our lives profoundly. At the beginning of the $13^{th}$ century, he introduced the Hindu-Arabic numeral system to Europe. Instead of the Roman numbers, where <strong>I</strong> stands for one, <strong>V</strong> for five, <strong>X</strong> for ten, and so on, the Hindu-Arabic numeral system uses position to index magnitude. This leads to much shorter expressions for large numbers.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>While the history of the <a href="https://thonyc.wordpress.com/2017/02/10/the-widespread-and-persistent-myth-that-it-is-easier-to-multiply-and-divide-with-hindu-arabic-numerals-than-with-roman-ones/">numerical system</a> is fascinating, this blog post will look at what Fibonacci is arguably most well known for: the <em>Fibonacci sequence</em>. In particular, we will use ideas from linear algebra to come up with a closed-form expression of the $n^{th}$ Fibonacci number<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. On our journey to get there, we will also gain some insights about recursion in R.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<h1 id="the-rabbit-puzzle">The rabbit puzzle</h1>
<p>In Liber Abaci, Fibonacci poses the following question (paraphrasing):</p>
<blockquote>
<p>Suppose we have two newly-born rabbits, one female and one male. Suppose these rabbits produce another pair of female and male rabbits after one month. These newly-born rabbits will, in turn, also mate after one month, producing another pair, and so on. Rabbits never die. How many pairs of rabbits exist after one year?</p>
</blockquote>
<p>The Figure below illustrates this process. Every point denotes one rabbit pair over time. To indicate that every newborn rabbit pair needs to wait one month before producing new rabbits, rabbits that are not fertile yet are coloured in grey, while rabbits ready to procreate are coloured in red.</p>
<div style="text-align:center;">
<img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" />
</div>
<p>We can derive a linear recurrence relation that describes the Fibonacci sequence. In particular, note that rabbits never die. Thus, at time point $n$, all rabbits from time point $n - 1$ carry over. Additionally, we know that every fertile rabbit pair will produce a new rabbit pair. However, they have to wait one month, so that the amount of fertile rabbits equals the amount of rabbits at time point $n - 2$. Resultingly, the Fibonacci sequence {$F_n$}$_{n=1}^{\infty}$ is:</p>
<script type="math/tex; mode=display">F_n = F_{n-1} + F_{n-2} \enspace ,</script>
<p>for $n \geq 3$ and $F_1 = F_2 = 1$. Before we derive a closed-form expression that computes the $n^{th}$ Fibonacci number directly, in the next section, we play around with alternative, more straightforward solutions in R.</p>
<h1 id="implementation-in-r">Implementation in R</h1>
<p>We can write a wholly inefficient, but beautiful program to compute the $n^{th}$ Fibonacci number:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-2</span><span class="p">))</span></code></pre></figure>
<p>R takes roughly 5 seconds to compute the $30^{\text{th}}$ Fibonacci number; computing the $40^{\text{th}}$ number exhausts my patience. This recursive solution is not particularly efficient because R executes the function an unnecessary amount of times. For example, the call tree for <em>fib(5)</em> is:</p>
<ul>
<li><em>fib(5)</em></li>
<li><em>fib(4)</em> + <em>fib(3)</em></li>
<li>(<em>fib(3)</em> + <em>fib(2)</em>) + (<em>fib(2)</em> + <em>fib(1)</em>)</li>
<li>((<em>fib(2)</em> + <em>fib(1)</em>) + (<em>fib(1)</em> + <em>fib(0)</em>)) + ((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>)</li>
<li>((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>) + (<em>fib(1)</em> + <em>fib(0)</em>)) + ((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>)</li>
</ul>
<p>which shows that <em>fib(2)</em> was called three times. This is not necessary, as we can store the outcome of this function call instead of recomputing it every time. This technique is called <a href="https://en.wikipedia.org/wiki/Memoization">memoization</a> (see also the R package <a href="https://github.com/r-lib/memoise">memoise</a>). Implementing this leads to:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_mem</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cache</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">inside</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">cache</span><span class="p">))</span><span class="w">
</span><span class="n">fib</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">inside</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cache</span><span class="p">[[</span><span class="nf">as.character</span><span class="p">(</span><span class="n">n</span><span class="p">)]]</span><span class="w"> </span><span class="o"><<-</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-2</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">cache</span><span class="p">[[</span><span class="nf">as.character</span><span class="p">(</span><span class="n">n</span><span class="p">)]]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>This computes the $1000^{th}$ Fibonacci in a tenth of a second. We can, of course, write this sequentially, and also store all intermediate Fibonacci numbers. This also avoids memory issues brought about by the recursive implementation. Interestingly, although this algorithm seems like it should be $O(n)$, it is actually $O(n^2)$ since we are adding increasingly large numbers (for more on this, see <a href="https://catonmat.net/linear-time-fibonacci">here</a>).</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">num</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="m">-2</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">num</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The first 30 Fibonacci numbers are: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040.</p>
<p>This is a rapid increase, as made apparent by the left Figure below. The Figure on the right shows that there is structure in how the sequence grows.</p>
<p><img src="/assets/img/2019-07-28-Fibonacci.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>We will return to the structure in growth at the end of the blog post. First, we need to derive a closed-form expression of the $n^{th}$ Fibonacci number. In the next section, we take a step towards that by realizing that diagonal matrices make for easier computations.</p>
<h1 id="diagonal-matrices-are-good">Diagonal matrices are good</h1>
<p>Our goal is to get a closed form expression of the $n^{th}$ Fibonacci number. The first thing to note is that, due to linear recursion, we can view the Fibonacci numbers as applying a linear map. In particular, define $T \in \mathcal{L}(\mathbb{R}^2)$ by:</p>
<script type="math/tex; mode=display">T(x, y) = (y, x + y) \enspace .</script>
<p>We note that:</p>
<script type="math/tex; mode=display">T^n(0, 1) = (F_n, F_{n+1}) \enspace ,</script>
<p>which we will prove by induction. In particular, note that the base case $n = 1$:</p>
<script type="math/tex; mode=display">T^1(0, 1) = (1, 0 + 1) = (1, 1) = (F_1, F_2) \enspace ,</script>
<p>does in fact give the first two Fibonacci numbers. Now for the induction step: we assume that this holds for an arbitrary $n$, and we show that it holds for $n + 1$ using the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T^n(0, 1) &= (F_n, F_{n+1}) \\[1em]
T(T^n(0, 1)) &= T(F_n, F_{n+1}) \\[1em]
T^{n+1}(0, 1) &= (F_{n+1}, F_n + F_{n+1}) \\[1em]
T^{n+1}(0, 1) &= (F_{n+1}, F_{n+2}) \enspace .
\end{aligned} %]]></script>
<p>The last equality follows from the definition of the Fibonacci sequence, i.e., the fact that any number is equal to the sum of the previous two numbers. The matrix of this linear map with respect to the standard basis is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
A \equiv \mathcal{M}(T) = \begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix} \enspace , %]]></script>
<p>since $T(1, 0) = (0, 1)$ and $T(0, 1) = (1, 1)$. Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} y \\ x + y \end{pmatrix} \enspace . %]]></script>
<p>In the sequential R code for computing the Fibonacci numbers, we have applied the linear map $n$ times, which gave us the Fibonacci number we were interested in. We can write this in matrix form:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>If you were to compute, say, the $3^{th}$ Fibonacci number using this equation, you would have to multiply $A$ three times with itself. Now assume you had something like:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Using the above equation, the matrix powers would become trivial:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix}^n = \begin{pmatrix} \lambda_1^n & 0 \\ 0 & \lambda_2^n \end{pmatrix} \enspace . %]]></script>
<p>There would be no need to repeatedly engage in matrix multiplication; instead, we would arrive at the $n^{th}$ Fibonacci number using only scalar multiplication! Our task is thus as follows: find a new matrix for the linear map which is diagonal. To solve this, we will need eigenvalues and eigenvectors.</p>
<h1 id="finding-eigenvalues-and-eigenvectors">Finding eigenvalues and eigenvectors</h1>
<p>An eigenvector-eigenvalue pair $(v, \lambda)$ satisfies for $v \neq 0$ that:</p>
<script type="math/tex; mode=display">Tv = \lambda v \enspace ,</script>
<p>which means that for a particular vector $v$, the linear map only stretches the vector by a constant $\lambda$. Here’s the key: using the eigenvectors as basis, the matrix of the linear map is diagonal. This is because the matrix of our linear map, $A$, is defined by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Tv_1 &= A_{11} v_1 + A_{21} v_2 \\
Tv_2 &= A_{12} v_1 + A_{22} v_2 \enspace .
\end{aligned} %]]></script>
<p>Now since the basis consists only of eigenvectors, we know that $Tv_1 = \lambda v_1$ and $Tv_2 = \lambda v_2$, which implies that $A_{11} = \lambda_1$ and $A_{21} = 0$, as well as $A_{12} = 0$ and $A_{22} = \lambda_2$. For a wonderful explanation of eigenvalues and eigenvectors, see <a href="https://www.youtube.com/watch?v=PFDu9oVAE-g">this video</a> by 3Blue1Brown.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p>In order to find the eigenvalues and eigenvectors, note that the linear map satisfies the following two equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T(x, y) &= \lambda (x, y) \\[1em]
T(x, y) &= (y, x + y) \enspace .
\end{aligned} %]]></script>
<p>This leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda x &= y \\[1em]
\lambda y &= x + y \enspace .
\end{aligned} %]]></script>
<p>We substitute the first expression into the second one, yielding:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda^2 x &= x + y \\[1em]
(\lambda^2 - 1)x &= y \enspace ,
\end{aligned} %]]></script>
<p>which we now substitute into the first equation, which results in:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda x &= (\lambda^2 - 1)x\\[1em]
0 &= \lambda^2 - \lambda - 1\enspace .
\end{aligned} %]]></script>
<p>We can now apply the <em>quadratic formula</em> or “Mitternachtsformel”, as it is called in parts of Germany because students should know the formula when they are roused from sleep at midnight. We are neither in Germany, nor is it midnight, nor can I actually remember the formula, so let’s quickly derive it for our problem:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda^2 - \lambda - 1 &= 0 \\[1em]
\lambda^2 - \lambda &= 1 \\[1em]
4\lambda^2 - 4\lambda &= 4 \\[1em]
4\lambda^2 - 4\lambda + 1&= 4 + 1 \\[1em]
(2\lambda - 1)^2&= 4 + 1 \\[1em]
2\lambda - 1 &= \pm \sqrt{4 + 1} \\[1em]
\lambda &= \frac{1 \pm \sqrt{5}}{2} \enspace .
\end{aligned} %]]></script>
<p>Now that we have found both eigenvalues, we go hunting for the eigenvectors! We put the eigenvalue into the equations from above:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{1 \pm \sqrt{5}}{2} x &= y \\[1em]
\frac{1 \pm \sqrt{5}}{2} y &= x + y \enspace .
\end{aligned} %]]></script>
<p>If we set $x = 1$, then $y = \frac{1 \pm \sqrt{5}}{2}$. Thus, two eigenvectors are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
v_1 &= \left(1, \frac{1 + \sqrt{5}}{2}\right) \\[1em]
v_2 &= \left(1, \frac{1 - \sqrt{5}}{2}\right) \enspace .
\end{aligned} %]]></script>
<p>As a sanity check to see whether this is indeed true, we check whether $Tv_1 = \lambda_1 v_1$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Tv_1 &= \left(\frac{1 + \sqrt{5}}{2}, 1 + \frac{1 + \sqrt{5}}{2}\right) \\[1em]
\lambda v_1 &= \frac{1 + \sqrt{5}}{2} \left(1, \frac{1 + \sqrt{5}}{2}\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \left(\frac{1 + \sqrt{5}}{2}\right)^2\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \frac{1 + 2\sqrt{5} + 5}{4}\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \frac{3}{2} + \frac{\sqrt{5}}{2} \right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, 1 + \frac{1 + \sqrt{5}}{2} \right) \enspace ,
\end{aligned} %]]></script>
<p>which shows that the two expression are equal. Moreover, the dot product of the two eigenvectors is zero, which means that the two eigenvectors are linearly independent (as they should be). In the next section, we will find that <a href="https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation">the same territory can be described by different maps</a>.</p>
<h1 id="change-of-basis">Change of basis</h1>
<p>Now that we have found the eigenvalues and eigenvectors, we can create the matrix $D$ of the linear map $T$ which is diagonal with respect to the basis of eigenvectors:</p>
<script type="math/tex; mode=display">% <![CDATA[
D = \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \enspace . %]]></script>
<p>We are not done yet, however. Note that $D$ is the matrix of the linear map $T$ with respect to the basis that consists of both eigenvectors $v_1$ and $v_2$, <em>not</em> with respect to the standard basis. We have changed our coordinate system — our map — as indicated by the Figure below; the black coloured vectors are the standard basis vectors while the vectors coloured in red are our new basis vectors.</p>
<!-- <div style = "float: left; padding: 10px 10px 10px 0px;"> -->
<!-- ![](/assets/img/change-of-basis.png) -->
<div style="text-align:center;">
<img src="../assets/img/change-of-basis.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" />
</div>
<!-- </div> -->
<p>To build some intuition, let’s play around with representing $\omega$ in both the standard basis and our new eigenbasis. Any vector is a linear combination of the basis vectors. Let $a_1$ and $a_2$ be the coefficients for the standard basis such that:</p>
<script type="math/tex; mode=display">\omega = a_1 \begin{pmatrix} 1 \\ 0 \end{pmatrix} + a_2 \begin{pmatrix} 0 \\ 1 \end{pmatrix} \enspace .</script>
<p>Now because I have drawn it earlier, I know that $a_1 = -1$ and $a_2 = 0.3$. This is the representation of $\omega$ in the standard basis. How do we represent it in our eigenbasis? Well, using the eigenbasis the vector $\omega$ is still a linear combination of the basis vectors, but with different coefficients; denote them as $b_1$ and $b_2$. We thus have:</p>
<script type="math/tex; mode=display">\omega = b_1 \begin{pmatrix} 1 \\ \frac{1 + \sqrt{5}}{2} \end{pmatrix} + b_2 \begin{pmatrix} 1 \\ \frac{1 - \sqrt{5}}{2} \end{pmatrix} = a_1 \begin{pmatrix} 1 \\ 0 \end{pmatrix} + a_2 \begin{pmatrix} 0 \\ 1 \end{pmatrix} \enspace .</script>
<p>If we write this in matrix form, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} b_1 \\ b_2 \end{pmatrix} &= \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix}\\[1em]
\begin{pmatrix} b_1 \\ b_2 \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^{-1} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>Thus, we can represent a vector $a$ with basis $S$ in our new basis $E$ by computing:</p>
<script type="math/tex; mode=display">b = E^{-1} S \, a \enspace .</script>
<p>In our eigenbasis, the vector $\omega$ has the coordinates:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">lambda1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">lambda2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda2</span><span class="p">))</span><span class="w">
</span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">a</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] -0.1422291
## [2,] -0.8577709</code></pre></figure>
<p>This means we have the representation:</p>
<script type="math/tex; mode=display">\omega = -0.14 \begin{pmatrix} 1 \\ \frac{1 + \sqrt{5}}{2} \end{pmatrix} - 0.86 \begin{pmatrix} 1 \\ \frac{1 - \sqrt{5}}{2} \end{pmatrix} \enspace ,</script>
<p>which makes intuitive sense when you look at the Figure above. For another beautiful linear algebra video by 3Blue1Brown, this time about changing bases, see <a href="https://www.youtube.com/watch?v=P2LTAUO1TdA&t=598s">here</a>. In the next section, we will use what we have learned above to express the $n^{th}$ Fibonacci number in closed-form.</p>
<h1 id="closed-form-fibonacci">Closed-form Fibonacci</h1>
<p>Recall from above that our solution to finding the $n^{th}$ Fibonacci number in matrix form is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Now, we have swapped the non-diagonal matrix $A$ with the diagonal matrix $D$ by changing the basis from the standard basis to the eigenbasis. However, the vector $(0, 1)^T$ is still in the standard basis! In order to change its representation to the eigenbasis, we multiply it with $E^{-1}$, as discussed above. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^{-1} \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Let’s use this to compute, say, the $10^{th}$ Fibonacci number (which is 55) in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">D</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">lambda1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda2</span><span class="p">))</span><span class="w">
</span><span class="n">D</span><span class="o">^</span><span class="m">10</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] 55.003636123
## [2,] -0.003636123</code></pre></figure>
<p>Ha! This didn’t quite work, did it? We got the answer for $F_{10}$ roughly when rounding, but $F_{11}$ is completely off. What did we miss? Well, this is in fact the correct answer — it is just in the wrong basis! We have to convert this from the eigenbasis to the standard basis. To do this, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
b &= E^{-1} S \, a \\
E b &= S \, a \\
E b &= a \enspace ,
\end{aligned} %]]></script>
<p>since $S$ is the identity matrix. Thus, all we have to do is to multiply with $E$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">D</span><span class="o">^</span><span class="m">10</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] 55
## [2,] 89</code></pre></figure>
<p>which is the correct solution. To get the closed-form solution algebraically, we first invert the matrix $E$:</p>
<script type="math/tex; mode=display">% <![CDATA[
E^{-1} = -\frac{1}{\sqrt{5}} \begin{pmatrix} \frac{1 - \sqrt{5}}{2} & -1 \\ - \frac{1 + \sqrt{5}}{2} & 1\end{pmatrix} \enspace , %]]></script>
<p>and we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n -\frac{1}{\sqrt{5}} \begin{pmatrix} \frac{1 - \sqrt{5}}{2} & -1 \\ - \frac{1 + \sqrt{5}}{2} & 1\end{pmatrix}\begin{pmatrix} 0 \\ 1 \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n \begin{pmatrix} -1 \\ 1 \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} -\left(\frac{1 + \sqrt{5}}{2}\right)^n \\ \left(\frac{1 - \sqrt{5}}{2}\right)^n \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} -\left(\frac{1 + \sqrt{5}}{2}\right)^n + \left(\frac{1 - \sqrt{5}}{2}\right)^n \\ -\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} + \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \end{pmatrix} \\[1em]
&= \frac{1}{\sqrt{5}} \begin{pmatrix} \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \\ \left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>The closed-form expression of the $n^{th}$ Fibonacci number is thus given by:</p>
<script type="math/tex; mode=display">F_n = \frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \right] \enspace .</script>
<p>We verify this in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_closed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="m">1</span><span class="o">/</span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(((</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">((</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">fib_closed</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">30</span><span class="p">)))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1 1 2 3 5 8 13 21 34 55
## [11] 89 144 233 377 610 987 1597 2584 4181 6765
## [21] 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040</code></pre></figure>
<h1 id="the-golden-ratio">The golden ratio</h1>
<p>In the above section, we have derived a closed-form expression of the $n^{th}$ Fibonacci number. In this section, we return to an observation we have made at the beginning: there is structure in how the Fibonacci numbers grow. Johannes Kepler, after whom the university in my home town is named, (re)discovered that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lim_{n \rightarrow \infty} \frac{F_{n+1}}{F_n} &= \lim_{n \rightarrow \infty} \frac{\frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \right]}{\frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \right]} \\[1em]
&= \lim_{n \rightarrow \infty} \frac{\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1}}{\left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n} \\[1em]
&= \frac{\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1}}{\left(\frac{1 + \sqrt{5}}{2}\right)^n} \\[1em]
&= \frac{1 + \sqrt{5}}{2} \approx 1.618 \enspace ,
\end{aligned} %]]></script>
<p>which is the <a href="https://en.wikipedia.org/wiki/Golden_ratio">golden ratio</a>. The golden ratio $\phi$ denotes that the ratio of two parts is equal to the ratio of the sum of the parts to the larger part, i.e., for $a > b > 0$:</p>
<script type="math/tex; mode=display">\phi \equiv \frac{a}{b} = \frac{a + b}{a} \enspace .</script>
<p>We have observed this empirically in the first Figure, which visualized the differences in the log of two consecutive Fibonacci numbers, and which yielded already for small $n$:</p>
<script type="math/tex; mode=display">\text{log} \, F_{n+1} - \text{log} \, F_n = \text{log} \, \frac{F_{n + 1}}{F_n} \approx 0.4812 \enspace ,</script>
<p>which exponentiated yields the golden ratio. Observe that $\left(\frac{1 - \sqrt{5}}{2}\right)^n$ goes to zero very quickly as $n$ grows so that we can compute the $n^{th}$ Fibonacci number by:</p>
<script type="math/tex; mode=display">F_n = \left \lfloor \frac{1}{\sqrt{5}} \phi^n \right \rceil \enspace ,</script>
<p>where we simply round to the nearest integer. To finally answer Fibonacci’s puzzle:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_golden</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="nf">round</span><span class="p">(((</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w">
</span><span class="n">fib_golden</span><span class="p">(</span><span class="m">12</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 144</code></pre></figure>
<p>After a mere twelve months of incest, there are 144 rabbit pairs!<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></p>
<p>There are various <a href="https://en.wikipedia.org/wiki/Generalizations_of_Fibonacci_numbers">generalizations</a> of the Fibonacci sequence. One such generalization is to allow higher orders $k$ in the sequence, which for $k = 3$ is known as the <a href="https://www.youtube.com/watch?v=fMJflV_GUpU">Tribonacci sequence</a>. Our approach for $k = 2$ can be straightforwardly generalized to account for any order $k$ (if you want to go down a rabbit hole, see for example <a href="https://math.stackexchange.com/questions/41667/fibonacci-tribonacci-and-other-similar-sequences">this</a>).</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have taken a detailed look at the Fibonacci sequence. In particular, we saw that it is the answer to a puzzle about procreating rabbits, and how to speed up a recursive algorithm for finding the $n^{th}$ Fibonacci number. We then used ideas from linear algebra to arrive at a closed-form expression of the $n^{th}$ Fibonacci number. Specifically, we have noted that the Fibonacci sequence is a linear recurrence relation — it can be viewed as repeatedly applying a linear map. With this insight, we observed that the matrix of the linear map is non-diagonal, which makes repeated execution tedious; diagonal matrices, on the other hand, are easy to multiply. We arrived at a diagonal matrix by changing the basis from the standard basis to the basis of eigenvectors, which led to a diagonal matrix of eigenvalues for the linear map. With this representation, the $n^{th}$ Fibonacci number is available in closed-form. In order to get it into the standard basis, we had to change basis back from the eigenbasis. We also saw how the Fibonacci numbers relate to the golden ratio $\phi$.</p>
<hr />
<p>I would like to thank Don van den Bergh, Jonas Haslbeck, and Sophia Crüwell for helpful comments on this blog post.</p>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is the main reason why the Hinu-Arabic numeral system took over. The belief that it is easier to multiply and divide using Hindu-Arabic numerals is <a href="https://thonyc.wordpress.com/2017/02/10/the-widespread-and-persistent-myth-that-it-is-easier-to-multiply-and-divide-with-hindu-arabic-numerals-than-with-roman-ones/">incorrect</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This blog post is inspired by exercise 16 on p. 161 in <a href="http://linear.axler.net/">Linear Algebra Done Right</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I have learned that there is already (very good) ink spilled on this topic, see for example <a href="https://bosker.wordpress.com/2011/04/29/the-worst-algorithm-in-the-world/">here</a> and <a href="https://bosker.wordpress.com/2011/07/27/computing-fibonacci-numbers-using-binet%E2%80%99s-formula/">here</a>. A nice essay is also <a href="https://opinionator.blogs.nytimes.com/2012/09/24/proportion-control/?mtrref=undefined&gwh=C0500419D79A9E5B64F17ABC970C5125&gwt=pay">this</a> piece by Steve Strogatz, who, by the way, wrote a wonderful book called <a href="https://www.goodreads.com/book/show/354421.Sync">Sync</a>. He’s also been on Sean Carroll’s Mindscape podcast, listen <a href="https://www.preposterousuniverse.com/podcast/2019/04/08/episode-41-steven-strogatz-on-synchronization-networks-and-the-emergence-of-complex-behavior/">here</a>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>If you forget everything that is written in this blog post, but through it were made aware of the videos by 3Blue1Brown (or <a href="https://www.numberphile.com/podcast/3blue1brown">Grant Sanderson</a>, as he is known in the real world), then I consider this blog post a success. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The downside of the closed-form solution is that it is difficult to calculate the power of the square root with high accuracy. In fact, <em>fib_golden</em> is incorrect for $n > 70$. Our <em>fib_mem</em> implementation is also incorrect, but only for $n > 93$. (I’ve compared it against Fibonacci numbers calculated from <a href="https://www.miniwebtool.com/list-of-fibonacci-numbers/?number=100">here</a>). <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderLeonardo Bonacci, better known as Fibonacci, has influenced our lives profoundly. At the beginning of the $13^{th}$ century, he introduced the Hindu-Arabic numeral system to Europe. Instead of the Roman numbers, where I stands for one, V for five, X for ten, and so on, the Hindu-Arabic numeral system uses position to index magnitude. This leads to much shorter expressions for large numbers.1 While the history of the numerical system is fascinating, this blog post will look at what Fibonacci is arguably most well known for: the Fibonacci sequence. In particular, we will use ideas from linear algebra to come up with a closed-form expression of the $n^{th}$ Fibonacci number2. On our journey to get there, we will also gain some insights about recursion in R.3 The rabbit puzzle In Liber Abaci, Fibonacci poses the following question (paraphrasing): Suppose we have two newly-born rabbits, one female and one male. Suppose these rabbits produce another pair of female and male rabbits after one month. These newly-born rabbits will, in turn, also mate after one month, producing another pair, and so on. Rabbits never die. How many pairs of rabbits exist after one year? The Figure below illustrates this process. Every point denotes one rabbit pair over time. To indicate that every newborn rabbit pair needs to wait one month before producing new rabbits, rabbits that are not fertile yet are coloured in grey, while rabbits ready to procreate are coloured in red. We can derive a linear recurrence relation that describes the Fibonacci sequence. In particular, note that rabbits never die. Thus, at time point $n$, all rabbits from time point $n - 1$ carry over. Additionally, we know that every fertile rabbit pair will produce a new rabbit pair. However, they have to wait one month, so that the amount of fertile rabbits equals the amount of rabbits at time point $n - 2$. Resultingly, the Fibonacci sequence {$F_n$}$_{n=1}^{\infty}$ is: for $n \geq 3$ and $F_1 = F_2 = 1$. Before we derive a closed-form expression that computes the $n^{th}$ Fibonacci number directly, in the next section, we play around with alternative, more straightforward solutions in R. Implementation in R We can write a wholly inefficient, but beautiful program to compute the $n^{th}$ Fibonacci number: This is the main reason why the Hinu-Arabic numeral system took over. The belief that it is easier to multiply and divide using Hindu-Arabic numerals is incorrect. ↩ This blog post is inspired by exercise 16 on p. 161 in Linear Algebra Done Right. ↩ I have learned that there is already (very good) ink spilled on this topic, see for example here and here. A nice essay is also this piece by Steve Strogatz, who, by the way, wrote a wonderful book called Sync. He’s also been on Sean Carroll’s Mindscape podcast, listen here. ↩Spurious correlations and random walks2019-06-29T10:00:00+00:002019-06-29T10:00:00+00:00https://fabiandablander.com/r/Spurious-Correlation<p>The number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: <em>completely independent processes can be correlated substantially</em>.</p>
<h2 id="ar1-processes-and-random-walks">AR(1) processes and random walks</h2>
<p>Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as:</p>
<script type="math/tex; mode=display">Y_t = \phi \, Y_{t-1} + \epsilon_t \enspace ,</script>
<p>where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a <em>random walk</em>. We can simulate from these using the following code:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">simulate_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">t</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="p">[</span><span class="n">t</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">phi</span><span class="o">*</span><span class="n">y</span><span class="p">[</span><span class="n">t</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">y</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The following R code simulates data from three independent random walks and an AR(1) process with $\phi = 0.5$; the Figure below visualizes them.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>As we can see from the plot, the AR(1) process seems pretty well-behaved. This is in contrast to the three random walks: all of them have an initial upwards trend, after which the red line keeps on growing, while the blue line makes a downward jump. In contrast to AR(1) processes, random walks are <em>not stationary</em> since their variance is not constant across time. For some very good lecture notes on time-series analysis, see <a href="https://www.economodel.com/time-series-analysis">here</a>.</p>
<h2 id="spurious-correlations-of-random-walks">Spurious correlations of random walks</h2>
<p>If we look at the correlations of these three random walks across time points, we find that they are substantial:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="nf">round</span><span class="p">(</span><span class="n">cor</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">red</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw1</span><span class="p">,</span><span class="w"> </span><span class="n">green</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw2</span><span class="p">,</span><span class="w"> </span><span class="n">blue</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw3</span><span class="p">)),</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## red green blue
## red 1.00 -0.49 -0.29
## green -0.49 1.00 0.59
## blue -0.29 0.59 1.00</code></pre></figure>
<p>I hope that this is at least a little bit of a shock. Upon reflection, however, it is clear that we are blundering: computing the correlation across time ignores the dependency between data points that is so typical of time-series data. To get more data about what is going on, we conduct a small simulation study.</p>
<p>In particular, we want to get an intuition of how this spurious correlation behaves with increasing sample sizes. We therefore simulate two independent random walks for sample sizes $n \in [50, 100, 200, 500, 1000, 2000]$ and compute their Pearson correlation, the test-statistic, and whether $p < \alpha$, where we set $\alpha$ to some an arbitrary value, say $\alpha = 0.05$. We repeated this 100 times and report the average of these quantities.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">)</span><span class="w">
</span><span class="n">comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ns</span><span class="p">)</span><span class="w">
</span><span class="n">ncomb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">comb</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncomb</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ix'</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'cor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">,</span><span class="w"> </span><span class="s1">'pval'</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ncomb</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cor.test</span><span class="p">(</span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">p.value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">tab</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">avg_abs_corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">cor</span><span class="p">)),</span><span class="w">
</span><span class="n">avg_abs_tstat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">tstat</span><span class="p">)),</span><span class="w">
</span><span class="n">percent_sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">pval</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">data.frame</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">tab</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## n avg_abs_corr avg_abs_tstat percent_sig
## 1 50 0.41 3.57 0.71
## 2 100 0.46 6.58 0.85
## 3 200 0.45 8.88 0.85
## 4 500 0.37 10.63 0.86
## 5 1000 0.41 17.05 0.88
## 6 2000 0.39 23.39 0.97</code></pre></figure>
<p>We observe that the average absolute correlation is very similar across $n$, but the test statistic grows with increased $n$, which naturally results in many more false rejections of the null hypothesis of no correlation between the two random walks.</p>
<p>To my knowledge, Granger and Newbold (1974) were the first to point out this puzzling fact.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> They regress one random walk onto the other instead of computing the Pearson correlation. (Note that the test statistic is the same). In a regression setting, we write:</p>
<script type="math/tex; mode=display">Y = \beta_0 + \beta_1 X + \epsilon \enspace ,</script>
<p>where we assume that $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ (see also a <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">previous</a> blog post). This is evidently violated when performing linear regression on two random walks, as demonstrated by the residual plot below.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Similar as above, we can have an AR(1) process on the residuals:</p>
<script type="math/tex; mode=display">\epsilon_t = \delta \epsilon_{t-1} + \eta_t \enspace ,</script>
<p>and test whether $\delta = 0$. We can do so using the <a href="https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic">Durbin-Watson test</a>, which yields:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">car</span><span class="o">::</span><span class="n">durbinWatsonTest</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## lag Autocorrelation D-W Statistic p-value
## 1 0.9357562 0.08623868 0
## Alternative hypothesis: rho != 0</code></pre></figure>
<p>This indicates substantial autocorrelation, violating our modeling assumption of independent residuals. In the next section, we look at the deeper mathematical reasons for why we get such spurious correlation. In the Post Scriptum, we relax the constraint that $\phi = 1$ and look at how spurious correlation behaves for AR(1) processes.</p>
<!-- In the next section, we will look more formally into the curious fact that two independent random walks are correlated. To understand why even with large $n$ the estimation goes awry, we have to make an excursion into asymptotia. -->
<h2 id="inconsistent-estimation">Inconsistent estimation</h2>
<p>The simulation results from the random walk simulations showed that the average (absolute) correlation stays roughly constant, while the test statistic increases with $n$. This indicates a problem with our estimator for the correlation. Because it is slightly easier to study, we focus on the regression parameter $\beta_1$ instead of the Pearson correlation. <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">Recall</a> that our regression estimate is</p>
<script type="math/tex; mode=display">\hat{\beta}_1 = \frac{\sum_{t=1}^N (x_t - \bar{x})(y_t - \bar{y})}{\sqrt{\sum_{t=1}^N (x_t - \bar{x})^2 \sum_{t=1}^N (y_t - \bar{y})^2}} \enspace ,</script>
<p>where $\bar{x}$ and $\bar{y}$ are the empirical means of the realizations $x_t$ and $y_t$ of the AR(1) processes $X_t$ and $Y_t$, respectively. The test statistic associated with the null hypothesis $\beta_1 = 0$ is</p>
<script type="math/tex; mode=display">t_{\text{statistic}} := \frac{\hat{\beta_1} - 0}{se(\hat{\beta_1})} = \frac{\hat{\beta_1}}{\hat{\sigma} / \sqrt{\sum_{t=1}^N (x_t - \bar{x})^2}} \enspace ,</script>
<p>where $\hat{\sigma}$ is the estimated standard deviation of the error. In simple linear regression, the test statistic follows a t-distribution with $n - 2$ degrees of freedom (it takes two parameters to fit a straight line). In the case of independent random walks, however, the test statistic does not have a limiting distribution; in fact, as $n \rightarrow \infty$, the distribution of $t_{\text{statistic}}$ diverges (Phillips, 1986).</p>
<p>To get an intuition for this, we plot the bootstrapped sampling distributions for $\beta_1$ and $t_{\text{statistic}}$, both for the case of regressing one independent AR(1) process onto another, and for random walk regression.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">regress_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">summary</span><span class="p">(</span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="p">)))[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bootstrap_limit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">ns</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'b1'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">)</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">ns</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">regress_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">coefs</span><span class="p">)</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ix</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">)</span><span class="w">
</span><span class="n">res_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bootstrap_limit</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w">
</span><span class="n">res_rw</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bootstrap_limit</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span></code></pre></figure>
<p>The Figure below illustrates how things go wrong when regressing one independent random walk onto the other. In contrast to the estimate for the AR(1) regression, the estimate $\hat{\beta}_1$ does not decrease in the case of a random walk regression. Instead, it stays roughly within $[-0.75, 0.75]$ across all $n$. This shines further light on the initial simulation results that the average correlation stays roughly the same. Moreover, in contrast AR(1) regression for which the distribution of the test statistic does not change, the distribution of the test statistic for the random walk regression seems to diverge. This explains why we the proportion of false positives increases with $n$.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-9-1.png" title="plot of chunk unnamed-chunk-9" alt="plot of chunk unnamed-chunk-9" style="display: block; margin: auto;" /></p>
<p>Rigorous arguments of the above statements can be found in Phillips (1986) and Hamilton (1994, pp. 577).<sup id="fnref:4"><a href="#fn:4" class="footnote">3</a></sup> The explanations feature some nice asympotic arguments which I would love go into in detail; however, I’m currently in Santa Fe for a summer school that has a very tightly packed programme. On that note: it is <a href="https://www.santafe.edu/engage/learn/schools/sfi-complex-systems-summer-school">very, very cool</a>. You should definitely apply next year! In addition to the stimulating lectures, wonderful people, and exciting projects, the surroundings are stunning<sup id="fnref:5"><a href="#fn:5" class="footnote">4</a></sup>.</p>
<div style="text-align:center;">
<img src="../assets/img/IAIA.jpeg" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="720" height="620" />
</div>
<!-- ### Brownian Motion -->
<!-- The type of random walk we focused on in this blog post takes place in discrete, equidistant time steps.[^3] If we take the limit of $n \rightarrow \infty$, however, we move from a discrete time random walk to a continuous time Brownian motion. The gist of the argument is to make the difference $\Delta Y_t$ between time points $Y_{t+1}$ and $Y_t$ infinitesimally small. Recall that the Gaussian distribution is [closed under addition](https://fdabl.github.io/statistics/Two-Properties.html), and that -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- Y_t &= \sum_{i=1}^t \eta_i \sim \mathcal{N}(0, t \cdot \sigma^2) \enspace \\[1em] -->
<!-- \Delta Y_t &= Y_{t+1} - Y_{t} = \sum_{i=1}^{t+1} \eta_i - \sum_{j=1}^t \eta_j = \eta_t \sim \mathcal{N}(0, \sigma^2) \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- We may cut $\eta_t$ into $n$ pieces -->
<!-- $$ -->
<!-- \eta_t = \eta_{1t} + \eta_{2t} + \ldots + \eta_{nt} \enspace , -->
<!-- $$ -->
<!-- where $\eta_{it} \sim \mathcal{N}(0, \frac{1}{n})$. Therefore, as we increase $n$, the discrete-time process is defined at a finer and finer grid. For $n \rightarrow \infty$, this results into the continuous-time Brownian motion, which we denote as $W(t)$, where $W: t \in [0, 1] \rightarrow \mathbb{R}$. -->
<!-- ## Solutions -->
<!-- Hamilton (1994, p. 562) discusses three solutions. One of them is to *difference* the data before doing the regression, i.e., -->
<!-- $$ -->
<!-- \Delta Y_t = \beta_0 + \beta_1 \Delta X_t + \epsilon_t \enspace , -->
<!-- $$ -->
<!-- where $\Delta Y_t = Y_{t+1} - Y_t$. This does in fact work: -->
<!-- ```{r} -->
<!-- broom::tidy(lm(diff(rw1) ~ diff(rw2))) -->
<!-- ``` -->
<!-- ```{r, echo = FALSE} -->
<!-- n <- 1000 -->
<!-- dat <- matrix(0, nrow = n, ncol = 2) -->
<!-- B <- cbind( -->
<!-- c(.4, .2), -->
<!-- c(-.2, .4) -->
<!-- ) -->
<!-- for (i in seq(2, n)) { -->
<!-- z <- rnorm(1) -->
<!-- # dat[i, ] <- dat[i-1, ] %*% B + rnorm(2) -->
<!-- dat[i, ] <- c(.8, .4) * z + rnorm(2) -->
<!-- } -->
<!-- ``` -->
<!-- Why? Let $\eta_t$ and $\psi_t$ denote the errors of the two processes $Y$ and $X$, respectively, distributed according to zero-mean Gaussian with variances $\sigma_y$ and $\sigma_x$. We write -->
<!-- $$ -->
<!-- \Delta Y_t = \sum_{i=1}^{t+1} \eta_i - \sum_{i=1}^{t} \eta_i = \eta_{t+1} \sim \mathcal{N}(0, \sigma_y^2) \\[1em] -->
<!-- \Delta X_t = \sum_{i=1}^{t+1} \psi_i - \sum_{i=1}^{t} \eta_i = \psi_{t+1} \sim \mathcal{N}(0, \sigma_x^2) \enspace . -->
<!-- $$ -->
<!-- Now, since the respective differences are independent of each other, their correlation will be zero. -->
<!-- However, Hamilton notes that if the time-series are really stationary ($\vert \phi \lvert < 1$), then this can result in misspecified regression. Moreover, if $Y$ and $X$ are non-stationary but *cointegrated processes*, then this also will result in misspecification. -->
<h2 id="conclusion">Conclusion</h2>
<p>“Correlation does not imply causation” is a common response to apparently spurious correlation. The idea is that we observe spurious associations because we do not have the full causal picture, as in the example of storks and human babies. In this blog post, we have seen that spurious correlation can be due to solely statistical reasons. In particular, we have seen that two independent random walks can be highly correlated. This can be diagnosed by looking at the residuals, which will <em>not</em> be independent and identically distributed, but will show a pronounced autocorrelation.</p>
<p>The mathematical explanation for the spurious correlation is not trivial. Using simulations, we found that the estimate of $\beta_1$ does not converge to the true value in the case of regressing one independent random walk onto another. Moreover, the test statistic diverges, meaning that with increasing sample size we are almost certain to reject the null hypothesis of no association. The spurious correlation occurs because our estimate is not consistent, which is a purely statistical explanation that does not invoke causal reasoning.</p>
<hr />
<p><em>I want to thank Toni Pichler and Andrea Bacilieri for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<!-- ### Mean and variance of AR(1) and random walk -->
<!-- To better understand the differences between AR(1) processes and random walks, we look at their respective first two moments. We write out the process for some window of length $j$, and then recursively substitute: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- Y_t &= \phi \, Y_{t-1} + \epsilon_t \\[.5em] -->
<!-- &= \phi \, \left(\phi \, Y_{t-2} + \epsilon_{t-1}\right) + \epsilon_t \\[.5em] -->
<!-- &= \phi \, \left(\phi \, \left(\phi \, Y_{t-3} + \epsilon_{t-2}\right) + \epsilon_{t-1}\right) + \epsilon_t \\[.5em] -->
<!-- &= \vdots \\[.5em] -->
<!-- &= \phi^{j + 1} \, Y_{t - (j + 1)} + \sum_{i=t}^{t - (j + 1)} \phi^i \epsilon_{t-i} \\[.5em] -->
<!-- &= \sum_{i=0}^{t-1} \phi^i \epsilon_{t-i} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where we assume that $Y_0 = 0$ is fixed. Let's compute the first two moments of this process. Exploiting linearity, we write: -->
<!-- $$ -->
<!-- \mathbb{E}[Y_t] = \mathbb{E}\left[\sum_{i=0}^{t-1} \phi^i \epsilon_{t-i}\right] = \sum_{i=0}^{t-1} \mathbb{E}\left[\phi^i \epsilon_{t-i}\right] = \sum_{i=0}^{t-1} \phi^i \mathbb{E}\left[\epsilon_{t-i}\right] = 0 \enspace . -->
<!-- $$ -->
<!-- This is also true for $\phi = 1$, i.e., a random walk. For the variance, we write: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{Var}\left[Y_t\right] &= \mathbb{E}\left[\left(Y_t - \mathbb{E}[Y_t]\right)^2\right] -->
<!-- = \mathbb{E}\left[Y_t^2\right] -->
<!-- = \mathbb{E}\left[\left(\sum_{i=0}^{t-1} \phi^i \epsilon_{t-i}\right)^2\right] \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where we split the quadratic into ["diagonal"](https://math.stackexchange.com/questions/125435/what-is-the-opposite-of-a-cross-term) terms and cross-terms, the latter of which have expectation zero by our assumption that the residuals are independent: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{Var}\left[Y_t\right] &= \mathbb{E}\left[\sum_{i=0}^{t - 1} \left(\phi^i \epsilon_{t-i}\right)^2 + \sum_{i=0}^{t - 1} \sum_{j\neq i}^{t - 1} \left(\phi^i \epsilon_{t-i}\right) \left(\phi^j \epsilon_{t-j}\right)\right] \\[.5em] -->
<!-- &= \mathbb{E}\left[\sum_{i=0}^{t - 1} \left(\phi^i \epsilon_{t-i}\right)^2\right] \\[.5em] -->
<!-- &= \sum_{i=0}^{t - 1} \mathbb{E}\left[\left(\phi^i \epsilon_{t-i}\right)^2\right] \\[.5em] -->
<!-- &= \sum_{i=0}^{t - 1} \left(\phi^i\right)^2 \mathbb{E}\left[\epsilon_{t-i}^2\right] \\[.5em] -->
<!-- &= \sigma^2\sum_{i=0}^{t - 1} \left(\phi^2\right)^i \\[.5em] -->
<!-- &= \sigma^2 \frac{1}{1 - \phi^2} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where the last line follows when $N \rightarrow \infty$ for $\vert\phi\vert < 1$ from a geometric series. For a random walk, however, this is not a geometric series anymore; it therefore does not converge, and the variance of a random walk does not exist. -->
<h3 id="spurious-correlation-of-ar1-processes">Spurious correlation of AR(1) processes</h3>
<p>In the main text, we have looked at how the spurious correlation behaves for a random walk. Here, we study how the spurious correlation behaves as a function of $\phi \in [0, 1]$. We focus on sample sizes of $n = 200$, and adapt the simulation code from above.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">200</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">phis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.02</span><span class="p">)</span><span class="w">
</span><span class="n">comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phis</span><span class="p">)</span><span class="w">
</span><span class="n">ncomb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">comb</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncomb</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ix'</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'phi'</span><span class="p">,</span><span class="w"> </span><span class="s1">'cor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">,</span><span class="w"> </span><span class="s1">'pval'</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ncomb</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">phi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cor.test</span><span class="p">(</span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">),</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">p.value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">phi</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">avg_abs_corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">cor</span><span class="p">)),</span><span class="w">
</span><span class="n">avg_abs_tstat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">tstat</span><span class="p">)),</span><span class="w">
</span><span class="n">percent_sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">pval</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>The Figure below shows that the issue of spurious correlation gets progressively worse as the AR(1) process approaches a random walk (i.e., $\phi = 1$). While this is true, the regression estimate remains consistent.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<h2 id="references">References</h2>
<ul>
<li>Granger, C. W., & Newbold, P. (<a href="http://wolfweb.unr.edu/~zal/STAT758/Granger_Newbold_1974.pdf">1974</a>). Spurious regressions in econometrics. <em>Journal of Econometrics, 2</em>(2), 111-120.</li>
<li>Hamilton, J. D. (<a href="https://press.princeton.edu/titles/5386.html">1994</a>). Time Series Analysis. P. Princeton, US: Princeton University Press.</li>
<li>Kuiper, R. M., & Ryan, O. (<a href="https://www.tandfonline.com/doi/full/10.1080/10705511.2018.1431046">2018</a>). Drawing conclusions from cross-lagged relationships: Re-considering the role of the time-interval. <em>Structural Equation Modeling: A Multidisciplinary Journal, 25</em>(5), 809-823.</li>
<li>Phillips, P. C. (<a href="http://dido.econ.yale.edu/korora/phillips/pubs/art/a044.pdf">1986</a>). Understanding spurious regressions in econometrics. <em>Journal of Econometrics, 33</em>(3), 311-340.</li>
<li>Matthews, R. Storks deliver babies (p = 0.008) (<a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9639.00013?casa_token=cWUllTD9P14AAAAA:PRERZz-uS2z9xX3DGt0-Qize94FuZuw-35s-2ECfUDY9Oi3J1m83cZh8EBHGlGh7fwQ2WHShOQuwB-YO">2000</a>). <em>Teaching Statistics 22</em>(2), 36–38.</li>
</ul>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>There are, of course, many <a href="https://www.tylervigen.com/spurious-correlations">more</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Thanks to Toni Pichler for drawing my attention to the fact that independent random walks are correlated, and Andrea Bacilieri for providing me with the classic references. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>Moreover, one way to avoid the spurious correlation is to <em>difference</em> the time-series. For other approaches, see Hamilton (1994, pp. 561). <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>This awesome picture was made by Luther Seet. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderThe number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.1 In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: completely independent processes can be correlated substantially. AR(1) processes and random walks Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as: where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a random walk. We can simulate from these using the following code: There are, of course, many more. ↩Bayesian modeling using Stan: A case study2019-05-30T10:00:00+00:002019-05-30T10:00:00+00:00https://fabiandablander.com/r/Law-of-Practice<link rel="stylesheet" href="../highlight/styles/default.css" />
<script src="../highlight/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<script>$('pre.stan code').each(function(i, block) {hljs.highlightBlock(block);});</script>
<p>Practice makes better. And faster. But what exactly is the relation between practice and reaction time? In this blog post, we will focus on two contenders: the <em>power law</em> and <em>exponential</em> function. We will implement these models in Stan and extend them to account for learning plateaus and the fact that, with increased practice, not only the mean reaction time but also its variance decreases. We will contrast two perspectives on predictive model comparison: a <em>(prior) predictive</em> perspective based on marginal likelihoods, and a <em>(posterior) predictive</em> perspective based on leave-one-out cross-validation. So let’s get started!<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<h1 id="two-models">Two models</h1>
<p>We can model the relation between reaction time (in seconds) and the number of practice trials as a power law function. Let $f: \mathbb{N} \rightarrow \mathbb{R}^+$ be a function that maps the number of trials to reaction times. We write</p>
<script type="math/tex; mode=display">f_p(N) = \alpha + \beta N^{-r} \enspace ,</script>
<p>where $\alpha$ is a lower bound (one cannot respond faster than that due to processing and motor control limits); $\beta$ is the learning gain from practice with respect to the first trial ($N = 1$); $N$ indexes the particular trial; and $r$ is the learning rate. Similarly, we can write</p>
<script type="math/tex; mode=display">f_e(N) = \alpha + \beta e^{-rN} \enspace ,</script>
<p>where the parameters have the same interpretation, except that $\beta$ is the learning gain from practice compared to no practice ($N = 0$).<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<p>What is the main difference between those two functions? <em>The exponential model assumes a constant learning rate, while the power model assumes diminishing returns</em>. To see this, let $\alpha = 0$, and ignore for the moment that $N$ is discrete. Taking the derivative for the power law model results in</p>
<script type="math/tex; mode=display">\frac{\partial f_p(N)}{\partial N} = -r\beta N^{-r - 1} = (-r/N) \, \beta N^{-r} = (-r/N) \, f_p(N) \enspace ,</script>
<p>which shows that the <em>local learning rate</em> — the change in reaction time as a function of $N$ — is $-r/N$; it depends on how many trials have been completed previously. The more one has practiced, the smaller the local learning rate $-r / N$. The exponential function, in contrast, shows no such dependency on practice:</p>
<script type="math/tex; mode=display">\frac{\partial f_e(N)}{\partial N} = -r\beta e^{-rN} = -r \, f_e(N) \enspace .</script>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>The figure above visualizes two data sets, generated from either a power law (left) or an exponential model (right), as well as the maximum likelihood fit of both models to these data. It is rather difficult to tell which model performs best just by eyeballing the fit. We thus need to engage in a more formal way of comparing models.</p>
<h1 id="two-perspectives-on-prediction">Two perspectives on prediction</h1>
<p>Let’s agree that the best way to compare models is to look at predictive accuracy. Predictive accuracy with respect to what? The figure below illustrates two different answers one might give. The grey shaded surface represents unobserved data; the white island inside is the observed data; the model is denoted by $\mathcal{M}$.</p>
<p><img src="../assets/img/prediction-perspectives.png" align="center" style="padding: 10px 10px 10px 10px;" /></p>
<p>On the left, the model makes predictions <em>before</em> seeing any data by means of its <em>prior predictive distribution</em>. The predictive accuracy is then evaluated on the actually observed data. In contrast, on the right, the model makes predictions <em>after</em> seeing the data by means of its <em>posterior predictive distribution</em>. In principle, its predictive accuracy is evaluated on data one does not observe (visualized as the grey area). One can estimate this expected <em>out-of-sample</em> predictive accuracy by cross-validation procedures which partition the observed data into a training and test set. The model only sees the training set, and makes predictions for the unseen test set.</p>
<p>One key practical distinction between these two perspectives is how predictions are generated. On the left, predictions are generated from the prior. On the right, the prior first gets updated to the posterior using the observed data, and it is through the posterior that predictions are made. In the next two sections, we make this difference more precise.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<!-- To contrast these two perspectives, let's study an example that, if you have read some of my previous blog posts (for example, [this](http://127.0.0.1:4000/r/Regularization.html) one), you will be thoroughly familiar with: coin flips. Assume we observe $y = 5$ heads out of $n = 10$ coin flips. We wish to compare the following two models -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \mathcal{M}_0&: \theta = 0.50 \\[1em] -->
<!-- \mathcal{M}_1&: \theta \sim \text{Beta}(1, 1) -->
<!-- \end{aligned} -->
<!-- $$ -->
<h2 id="prior-prediction-marginal-likelihoods">Prior prediction: Marginal likelihoods</h2>
<p>From this perspective, we weight the model’s prediction of the observed data given a particular parameter setting by the prior. This is accomplished by integrating the likelihood with respect to the prior, which gives the so-called <em>marginal likelihood</em> of a model $\mathcal{M}$:</p>
<script type="math/tex; mode=display">p(y \mid \mathcal{M}) = \int_{\Theta} p(y \mid \theta, \mathcal{M}) \, p(\theta \mid \mathcal{M}) \, \mathrm{d}\theta \enspace .</script>
<p>It is clear that the prior matters a great deal, but that is no surprise — it is part of the model. A bad prior means a bad model. The ratio of two such marginal likelihoods is known as the Bayes factor.</p>
<p>If one is willing to assign priors to models, one can compute posterior model probabilities, i.e.,</p>
<script type="math/tex; mode=display">\begin{equation}
p(\mathcal{M}_k \mid y) = p(\mathcal{M}_k) \times \frac{p(y \mid \mathcal{M}_k)}{\sum_{i=1}^K p(y \mid \mathcal{M}_i) \, p(\mathcal{M}_i)} \enspace ,
\end{equation}</script>
<p>where $K$ is the number of models under consideration (see also a <a href="https://fdabl.github.io/r/Spike-and-Slab.html">previous</a> blogpost). Observe that the marginal likelihood features prominently: it is an updating factor from prior to posterior model probability. With this, one can also compute <em>Bayesian model-averaged</em> predictions:</p>
<script type="math/tex; mode=display">p(\tilde{y} \mid y) = \sum_{k=1}^K p(\tilde{y} \mid y, \mathcal{M}_k) \, \underbrace{p(\mathcal{M}_k \mid y)}_{w_k} \enspace ,</script>
<p>where $\tilde{y}$ is unseen data, and where the prediction of each model gets weighted by its posterior probability. We denote a model weight, which in this case is its posterior probability, as $w_k$.</p>
<h2 id="posterior-prediction-leave-one-out-cross-validation">Posterior prediction: Leave-one-out cross-validation</h2>
<p>Another perspective aims to estimate the expected out-of-sample prediction error, or expected log predictive density, i.e.,</p>
<script type="math/tex; mode=display">\text{elpd}^{\mathcal{M}} = \mathbb{E}_{\tilde{y}} \left(\text{log} \, p(\tilde{y} \mid y) \right) \enspace ,</script>
<p>where the expectation is taken with respect to unseen data $\tilde{y}$ (which is visualized as a grey surface with an $?$ inside in the figure above).</p>
<p>Clearly, as we do not have access to unseen data, we cannot evaluate this. However, one can approximate this quantity by computing the leave-one-out prediction error in our sample:</p>
<script type="math/tex; mode=display">\widehat{\text{elpd}}^{\mathcal{M}}_{\text{loo}} = \frac{1}{n} \sum_{i=1}^n \, \text{log} \, p(y_i \mid y_{-i}) \approx \mathbb{E}_{\tilde{y}} \left(\text{log} \, p(\tilde{y} \mid y) \right) \enspace,</script>
<p>where $y_i$ is the $i^{\text{th}}$ data point, and $y_{-i}$ are all data points except $y_i$, where we have suppressed conditioning on $\mathcal{M}$ to not clutter notation (even more), and where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y_i \mid y_{-i}) &= \int_{\Theta} p(y_i, \theta \mid y_{-i}) \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} p(y_i \mid y_{-i}, \theta) \, p(\theta \mid y_{-i}) \, \mathrm{d}\theta \enspace .
\end{aligned} %]]></script>
<p>Observe that this requires integrating over the <em>posterior distribution</em> of $\theta$ given all but one data point; this is in contrast to the marginal likelihood perspective, which requires integration with respect to the <em>prior distribution</em>. From this perspective, one can similarly compute model weights $w_k$</p>
<script type="math/tex; mode=display">w_k = \frac{\text{exp}\left(\widehat{\text{elpd}}^{\mathcal{M}_k}_{\text{loo}}\right)}{\sum_{i=1}^K \text{exp}\left(\widehat{\text{elpd}}^{\mathcal{M}_i}_{\text{loo}}\right)} \enspace ,</script>
<p>where $\widehat{\text{elpd}}_{\text{loo}}^{\mathcal{M}_k}$ is the loo estimate for the expected log predictive density for model $\mathcal{M}_k$. For prediction, one averages across models using these <em>Pseudo Bayesian model-averaging</em> weights (Yao et al., 2018, p. 92).</p>
<p>However, Yao et al. (2018) and Vehtari et al. (2019) recommend against using these Pseudo BMA weights, as they do not take the uncertainty of the loo estimates into account. Instead, they suggest using Pseudo-BMA+ weights or stacking. For details, see Yao et al. (2018).</p>
<p>For an illuminating discussion about model selection based on marginal likelihoods or leave-one-out cross-validation, see Gronau & Wagenmakers (<a href="https://link.springer.com/article/10.1007/s42113-018-0011-7">2019a</a>), Vehtari et al. (<a href="https://link.springer.com/article/10.1007/s42113-018-0020-6">2019</a>), and Gronau & Wagenmakers (<a href="https://link.springer.com/article/10.1007/s42113-018-0022-4">2019b</a>).</p>
<p>Now that we have taken a look at these two perspectives on prediction, in the next section, we will implement the power law and the exponential model in Stan.</p>
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{PSBF}_{10} &= \frac{\prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_1)}{\prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_0)} \\[1em] -->
<!-- &= \text{exp}\left(\text{log} \, \prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_1) - \text{log} \, \prod_{i=1}^n p(y_i \mid y_{-i}, \mathcal{M}_0)\right) \\[.5em] -->
<!-- &= \text{exp}\left(\sum_{i=1}^n \text{log} \, p(y_i \mid y_{-i}, \mathcal{M}_1) - \sum_{i=1}^n \text{log} \, p(y_i \mid y_{-i}, \mathcal{M}_0)\right) \\[.5em] -->
<!-- &= \text{exp}\left(\hat{\text{elpd}}^{\mathcal{M}_1}_{\text{loo}} - \hat{\text{elpd}}^{\mathcal{M}_0}_{\text{loo}} \right) \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<h1 id="implementation-in-stan">Implementation in Stan</h1>
<p>As is common in psychology, people do not deterministically follow a power law or an exponential law. Instead, the law is probabilistic: given the same task, the person will respond faster or slower, never exactly as before. To allow for this, we assume that there is Gaussian noise around the function value. In particular, we assume that</p>
<script type="math/tex; mode=display">\text{RT}_N \sim \mathcal{N}\left(\alpha + \beta N^{-r}, \sigma_e^2\right) \enspace .</script>
<p>Note that are not normally distributed; we address this later. We make the same assumption for the exponential model. The following code implements the power law model in Stan.</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">// In the data block, we specify everything that is relevant for
// specifying the data. Note that PRIOR_ONLY is a dummy variable used later.
data {
int<lower=1> n;
real y[n];
int<lower=0, upper=1> PRIOR_ONLY;
}
// In the parameters block, we specify all parameters we need.
// Although Stan implicitly adds flat priors on the (positive) real line
// we will specify informative priors below.
parameters {
real<lower=0> r;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> sigma_e;
}
// In the model block, we specify our informative priors and
// the likelihood of the model, unless we want to sample only from
// the prior (i.e., if PRIOR_ONLY == 1)
model {
target += lognormal_lpdf(alpha | 0, .5);
target += lognormal_lpdf(beta | 1, .5);
target += gamma_lpdf(r | 1, 3);
target += gamma_lpdf(sigma_e | 0.5, 5);
if (PRIOR_ONLY == 0) {
for (trial in 1:n) {
target += normal_lpdf(y[trial] | alpha + beta * trial^(-r), sigma_e);
}
}
}
// In this block, we make posterior predictions (ypred) and compute
// the log likelihood of each data point (log_lik)
// which is needed for the computation of loo later
generated quantities {
real ypred[n];
real log_lik[n];
for (trial in 1:n) {
ypred[trial] = normal_rng(alpha + beta * trial^(-r), sigma_e);
log_lik[trial] = normal_lpdf(y[trial] | alpha + beta * trial^(-r), sigma_e);
}
}</code></pre></figure>
<p>From a marginal likelihood perspective, the prior is an integral part of the model; this means we have to think very carefully about it. There are several principles that can guide us (see also Lee & Vanpaemel, 2018), but one that is particularly helpful here is to look at the prior predictive distribution. Do draws from the prior predictive distribution look like what we had in mind? Below, I have visualized the mean, the standard deviation around the mean, and several draws from it for (a) flat priors on the positive real line, and (b) informed priors that I chose based on reading Evans et al. (2018). In the Stan code, you can specify flat priors by commenting out the priors we have specified in the model block.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>The Figure on the left shows that flat priors make terrible predictions. The mean of the prior predictive distribution is a constant function at zero, which is not at all what we had in mind when writing down the power law model. Even worse, flat priors allow for negative reaction time, something that is clearly impossible! In contrast, the Figure on the right seems reasonable. Below I have visualized the informed priors.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>From a cross-validation perspective, priors do not matter <em>that</em> much; prediction is conditional on the observed data, and so the prior is transformed to a posterior before the model makes predictions. If the prior is not too misspecified, or if we have a sufficient amount of data so that the prior has only a weak influence on the posterior, a model’s posterior predictions will not markedly depend on it.</p>
<h1 id="practical-model-comparison">Practical model comparison</h1>
<p>Let’s check whether we select the correct model for the power law and the exponential data, respectively. I have generated the data above using the following code, which will make sense later in the blog post.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sim_power</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sdlog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rlnorm</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">N</span><span class="p">),</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">N</span><span class="o">^</span><span class="p">(</span><span class="o">-</span><span class="n">r</span><span class="p">)),</span><span class="w"> </span><span class="n">sdlog</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">sim_exp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="p">,</span><span class="w"> </span><span class="n">delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sdlog</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">delta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rlnorm</span><span class="p">(</span><span class="nf">length</span><span class="p">(</span><span class="n">N</span><span class="p">),</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">alpha</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="o">-</span><span class="n">r</span><span class="o">*</span><span class="n">N</span><span class="p">)),</span><span class="w"> </span><span class="n">sdlog</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">30</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">xp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sim_power</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">.7</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="n">xe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sim_exp</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">.2</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">,</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span></code></pre></figure>
<p>We use the <em>bridgesampling</em> and the <em>loo</em> package to estimate Bayes factors and loo scores and stacking weights, respectively.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'loo'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'bridgesampling'</span><span class="p">)</span><span class="w">
</span><span class="n">comp_power</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s1">'stan-compiled/compiled-power-model.RDS'</span><span class="p">)</span><span class="w">
</span><span class="n">comp_exp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">readRDS</span><span class="p">(</span><span class="s1">'stan-compiled/compiled-exponential-model.RDS'</span><span class="p">)</span><span class="w">
</span><span class="c1"># power model data</span><span class="w">
</span><span class="n">fit_pp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_power</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xp</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ep</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xp</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># exponential data</span><span class="w">
</span><span class="n">fit_pe</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_power</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xe</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ee</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">xe</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>But before we do so, let’s visualize the posterior predictions of both models for each simulated data set, respectively.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>In the figure on the left, we see that the posterior predictions for the exponential model have a larger variance compared to the predictions of the power law model. Conversely, on the right it seems that the exponential model gives the better predictions.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p>We first compare the two models on the power law data, using the Bayes factor and loo. With the former, we find overwhelming evidence in favour of the power law model.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_pp</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ep</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 54950.14619</code></pre></figure>
<p>Note that this estimate can vary with different runs (since we were stingy in sampling from the posterior). The comparison using loo yields the following output:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_pp</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_pp</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_ep</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ep</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo_pp</span><span class="p">,</span><span class="w"> </span><span class="n">loo_ep</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -14.2 5.4</code></pre></figure>
<p>Note that the best model is always on top, and the comparison is already on the difference score. Following a two standard error heuristic (but see <a href="https://discourse.mc-stan.org/t/interpreting-output-from-compare-of-loo/3380/4">here</a>), since the difference in the elpd scores is more than twice its standard error, we would choose the power law model as the better model. <strong>But wait</strong> – what do these warnings mean? Let’s look at the output of the loo function for the exponential law (suppressing the warning):</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_ep</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">##
## Computed from 16000 by 30 log-likelihood matrix
##
## Estimate SE
## elpd_loo 22.9 5.7
## p_loo 7.4 3.9
## looic -45.8 11.5
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 28 93.3% 6801
## (0.5, 0.7] (ok) 1 3.3% 369
## (0.7, 1] (bad) 0 0.0% <NA>
## (1, Inf) (very bad) 1 3.3% 9
## See help('pareto-k-diagnostic') for details.</code></pre></figure>
<p>We find that there are is one very bad Pareto $k$ value. What does this mean? There are many details about loo that you can read up on in Vehtari et al. (2018). Put briefly, to efficiently compute the loo score of a model, Vehtari et al. (2018) use <em>importance sampling</em> to compute the predictive density, which requires finding importance weights to better approximate this density. These importance weights are known to be unstable, and the authors introduce a particular stabilizing transformation which they call “Pareto smoothed importance sampling” (PSIS). The parameter $k$ is the shape parameter of this (generalized) Pareto distribution. If it is high, such as $k > 0.7$ as we find for one data point here, then this implies unstable estimates — we should probably not trust the loo estimate for this model.<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup> Similarly pathological behaviour can be diagnosed by p_loo, which gives an estimate of the effective number of parameters. In this case, this is about double the number of actual parameters ($\alpha, \beta, r, \sigma_e^2$).</p>
<p>The two figures below visualize the $k$ values for each data point. We see that loo has troubles predicting the first data point for both the exponential and power law model.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<p>We can also compute the stacking weights; but again, one should probably not trust these estimates:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fit1</span><span class="p">,</span><span class="w"> </span><span class="n">fit2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">log_lik_list</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="s1">'1'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit1</span><span class="p">),</span><span class="w">
</span><span class="s1">'2'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">r_eff_list</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="w">
</span><span class="s1">'1'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">relative_eff</span><span class="p">(</span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit1</span><span class="p">,</span><span class="w"> </span><span class="n">merge_chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)),</span><span class="w">
</span><span class="s1">'2'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">relative_eff</span><span class="p">(</span><span class="n">extract_log_lik</span><span class="p">(</span><span class="n">fit2</span><span class="p">,</span><span class="w"> </span><span class="n">merge_chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">loo_model_weights</span><span class="p">(</span><span class="n">log_lik_list</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'stacking'</span><span class="p">,</span><span class="w"> </span><span class="n">r_eff_list</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">r_eff_list</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_pp</span><span class="p">,</span><span class="w"> </span><span class="n">fit_ep</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 0.995
## 2 0.005</code></pre></figure>
<p>We could compute a “Stacked Pseudo Bayes factor” by taking the ratio of these two weights to see how much more weight one model is given compared to the other.<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> This yields a factor of about 120 in favour of the power law model.</p>
<p>We can make the same comparisons using the exponential data. The Bayes factor again yields overwhelming support for the model that is closer to the true model, i.e., in this case the exponential model.<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup> But note that the evidence is an order of magnitude smaller than in the above comparison.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_pe</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 1024.51382</code></pre></figure>
<p>This marked decrease in evidence is also tracked by loo, which now tells us that we cannot reliably distinguish between the two models<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup>:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">),</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_pe</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -8.3 5.6</code></pre></figure>
<p>Note that while we still get warnings, this time we only have one data point with $k \in [0.7, 1]$, which is bad, but not very bad. The stacking weights show that there is not a clear winner, with a factor of only about 6 in favour of the exponential model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">,</span><span class="w"> </span><span class="n">fit_pe</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 0.847
## 2 0.153</code></pre></figure>
<p>Now that we have seen how one might compare these models in practice, in the next two sections, we will see how we can extend the model to be more realistic. To that end, I will focus only on the exponential model. In fact, the exponential model is what many researchers now prefer as the “law of practice”. In a very influential article, Newell & Rosenbloom (1981) found that a power law fit best for data from a wide variety of tasks. While they relied on averaged data, Heathcote et al. (2000) looked at participant-specific data. They found that the decrease in reaction time follows an exponential function, implying that previous results were biased due to averaging; in fact, one can show that the (arithmetic) averaging of many exponential (i.e., non-linear) functions can lead to a group-level power law when individual differences exist (Myung, Kim, & Pitt, 2000).</p>
<h1 id="extension-i-modeling-plateaus">Extension I: Modeling plateaus</h1>
<p>The above two models assume that participants <em>get it</em> from the first trial onwards, and become better immediately. However, real data often exhibits a <em>plateau</em>. We simulate such data using the following lines of code.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
</span><span class="n">rlnorm</span><span class="p">(</span><span class="m">15</span><span class="p">,</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">),</span><span class="w">
</span><span class="n">sim_exp</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">16</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-17-1.png" title="plot of chunk unnamed-chunk-17" alt="plot of chunk unnamed-chunk-17" style="display: block; margin: auto;" /></p>
<p>How can we model this? Evans et al. (2018) suggest introducing a single parameter, $\tau$, and adjusting the model as follows:</p>
<script type="math/tex; mode=display">f_e(N) = \alpha + \beta \, \frac{\tau + 1}{\tau + e^{rN}} \enspace .</script>
<p>Observe that for $\tau = 0$, we recover the original exponential model. For $\tau \rightarrow \infty$, the function becomes a constant function $\alpha + \beta$. Thus, large values for $\tau$ (and large values for $r$, so as to model the steep drop in reaction time) allow us to model the initial plateau we find in real data.</p>
<p>We can adjust the model in Stan easily:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">data {
int<lower=1> n;
real y[n];
int<lower=0, upper=1> PRIOR_ONLY;
}
parameters {
real<lower=0> r;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> sigma_e;
real<lower=0> tau;
}
model {
real mu;
target += cauchy_lpdf(tau | 0, 1);
target += lognormal_lpdf(alpha | 0, .5);
target += lognormal_lpdf(beta | 1, .5);
target += gamma_lpdf(r | 1, 3);
target += gamma_lpdf(sigma_e | 0.5, 5);
if (PRIOR_ONLY == 0) {
for (trial in 1:n) {
mu = alpha + beta * (tau + 1) / (tau + exp(r * trial));
target += normal_lpdf(y[trial] | mu, sigma_e);
}
}
}
generated quantities {
real mu;
real ypred[n];
real log_lik[n];
for (trial in 1:n) {
mu = alpha + beta * (tau + 1) / (tau + exp(r * trial));
ypred[trial] = normal_rng(mu, sigma_e);
log_lik[trial] = normal_lpdf(y[trial] | mu, sigma_e);
}
}</code></pre></figure>
<p>We have put a half-Cauchy prior on $\tau$. This is because the Cauchy distribution has very fat tails, compared to for example the Normal or the Laplace distribution; see Figure below. This is desired, because we need large $\tau$ values to accommodate plateaus.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-19-1.png" title="plot of chunk unnamed-chunk-19" alt="plot of chunk unnamed-chunk-19" style="display: block; margin: auto;" /></p>
<p>The figure below compares the prior predictive distributions of the exponential to the <em>delayed exponential</em> model. As we can see, the additional $\tau$ parameter creates larger uncertainty in the predictions, with the some individual draws looking completely different from each other. <a href="https://betanalpha.github.io/assets/case_studies/fitting_the_cauchy.html">Drawing samples from a Cauchy</a>, whose mean and variance are <a href="https://en.wikipedia.org/wiki/Cauchy_distribution#Explanation_of_undefined_moments">undefined</a>, is tricky.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-20-1.png" title="plot of chunk unnamed-chunk-20" alt="plot of chunk unnamed-chunk-20" style="display: block; margin: auto;" /></p>
<p>However, if we compare the two models to each other on the plateau data set, we see that the extended model predicts the data better:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit_ee</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_ee_plateau</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp_delayed</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 152.06430</code></pre></figure>
<p>This is a large Bayes factor. However, loo seems to favour the delayed exponential model much more, by about three standard errors:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">),</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -10.4 3.3</code></pre></figure>
<p>This is also reflected in an extreme difference in stacking weights, which completely discounts the standard exponential model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">,</span><span class="w"> </span><span class="n">fit_ee</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 1.000
## 2 0.000</code></pre></figure>
<p>Earlier, when comparing the exponential and power model on data generated from an exponential model, we found a Bayes factor in favour of the exponential model of about 1000, while the difference in loo was only about 1.5 standard errors. Here, we now find a Bayes factor in favour of the delayed exponential model of only about 150, while loo finds a difference of about three standard errors. This contrast becomes is illuminated by visualizing the posterior predictive distribution, see below.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-24-1.png" title="plot of chunk unnamed-chunk-24" alt="plot of chunk unnamed-chunk-24" style="display: block; margin: auto;" /></p>
<p>We see that the delayed exponential model seems to “fit” the data much better.<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup> Since loo basically uses this posterior predictive distribution (except that it removes one data point) to predict individual data points, it now seems clearer why it should favour the delayed exponential model so much more strongly. In contrast, the prior predictive distributions of the exponential and the delayed exponential (visualized above) are rather similar. Since the Bayes factor evaluated the probability of the data given these prior predictive distributions, the evidence in favour of the delayed exponential model is only modest.</p>
<h1 id="extension-ii-modeling-reaction-times">Extension II: Modeling reaction times</h1>
<h2 id="the-lognormal-distribution">The Lognormal distribution</h2>
<p>Reaction times are well known to be non-normally distributed. One popular distribution for them is the <em>Lognormal distribution</em>. Let $X \sim \mathcal{N}(\mu, \sigma^2)$. Then $Z = e^X$ is lognormally distributed. To arrive at its density function, we do a change of variables. Observe that</p>
<script type="math/tex; mode=display">P_z(Z \leq z) = P_z\left(e^X \leq z\right) = P_x(X \leq \text{log}\,z) = F_x(\text{log}\,z) \enspace ,</script>
<p>where $F_x$ is the cumulative distribution function of $X$. Differentiating with respect to $z$ yields the probability density function for $Z$:</p>
<script type="math/tex; mode=display">p_z(z) = \frac{P_z(Z \leq z)}{\mathrm{d} z} = \frac{F_x(\text{log}\,z)}{\mathrm{d} z} = p_x(\text{log}\,z) \left|\frac{\text{log}\,z}{\mathrm{d}z}\right| = p_x(\text{log}\,z) \frac{1}{z} \enspace ,</script>
<p>which spelled out is</p>
<script type="math/tex; mode=display">p_z(z) = \frac{1}{\sqrt{2\pi\sigma^2}} \text{exp}\left(-\frac{1}{2\sigma^2} (\text{log}\,z - \mu)^2\right) \frac{1}{z} \enspace .</script>
<p>The figure below visualizes various lognormal distribution with different parameters $\mu$ and $\sigma^2$. The figure on the left shows how a change in $\sigma^2$ affects the distribution, while keeping $\mu = 1$. You can see that the “peak” of the distribution changes, indicating that the parameter $\mu$ is not independent of $\sigma^2$. In fact, while the mean and variance of a normal distribution are given by $\mu$ and $\sigma^2$, respectively, this is not so for a lognormal distribution. It seems difficult to compute the first two moments of the Lognormal distributions directly (you can try if you want!). However, there is a neat trick to compute <em>all</em> its moments basically instantaneously.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-25-1.png" title="plot of chunk unnamed-chunk-25" alt="plot of chunk unnamed-chunk-25" style="display: block; margin: auto;" /></p>
<p>To do this, observe that the <a href="https://en.wikipedia.org/wiki/Moment-generating_function"><em>moment generating function</em></a> (MGF) of a random variable $X$ is given by</p>
<script type="math/tex; mode=display">M_X(t) := \mathbb{E}\left[e^{tX}\right] \enspace .</script>
<p>Now, we’re not going to use the MGF of the Lognormal — in fact, because the integral diverges, it does not exist. Instead, we’ll use the MGF of a Normal distribution. Since $Z = e^X$, we can write the $t^{\text{th}}$ moment such that</p>
<script type="math/tex; mode=display">\mathbb{E}\left[Z^t\right] = \mathbb{E}\left[e^{tX}\right] = M_X(t) = \text{exp}\left(t\mu + \frac{1}{2}t^2 \sigma^2\right) \enspace ,</script>
<p>where the last term is the MGF of a normal distribution (see also Blitzstein & Hwan, 2014, p. 260-261). Thus, the mean and variance of a Lognormal distribution are given by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[Z] &= \text{exp}\left(\mu + \frac{1}{2} \sigma^2\right) \\[.5em]
\text{Var}[Z] &= \mathbb{E}[Z^2] - \mathbb{E}[Z]^2 \\[.5em]
&= \text{exp}\left(2\mu + 2\sigma^2\right) - \text{exp}\left(2\mu + \sigma^2\right) \\[.5em]
&= \text{exp}\left(2\mu + \sigma^2\right) \left(\text{exp}\left(\sigma^2\right) - 1 \right) \enspace .
\end{aligned} %]]></script>
<p>This dependency between mean and variance is desired. In particular, it is well established that changes in mean reaction times are accompanied by proportional changes in the standard deviation (Wagenmakers & Brown, 2007).</p>
<p><a href="https://fdabl.github.io/statistics/Two-Properties.html">In contrast</a> to the Normal distribution, the Lognormal distribution is not closed under addition. This means that if $Z$ has a Lognormal distribution, $Z + \delta$ does not necessarily have a Lognormal distribution anymore. However, we are interested in modeling <em>shifts</em> in reaction times. For example, there is a reaction time $\alpha$ faster which participants cannot meaningfully respond. To allow for such shifts, we expand the Lognormal distribution by a parameter $\delta$ such that $Z = e^{X - \delta}$.<sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup> This leads to the following density function</p>
<script type="math/tex; mode=display">p_z(z) = \frac{1}{\sqrt{2\pi\sigma^2}} \text{exp}\left(-\frac{1}{2\sigma^2} (\text{log}\,(z - \delta) - \mu)^2\right) \frac{1}{z - \delta} \enspace .</script>
<h2 id="extending-the-model">Extending the model</h2>
<p>We extend the delayed exponential model such that</p>
<script type="math/tex; mode=display">\text{RT} \sim \text{Shifted-Lognormal}\left(\delta, \, \text{log}\left(\alpha' + \beta \frac{\tau + 1}{\tau + e^{rN}}\right), \, \sigma \right) \enspace.</script>
<p>The median of a Shifted-Lognormal distribution is given by $\delta + e^\mu$, which is why we log the main part of the model above. Note that the previous asymptote $\alpha$ is now $\delta + \alpha’$. To be on the same scale as before, we assign $\delta$ and $\alpha’$ a Lognormal distribution with medians $0.50$ and $0.50$, respectively.</p>
<p>We can implement the model in Stan by writing a Shifted-Lognormal probability density function:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">// In the function block, we define our new probability density function
functions {
real shiftlognormal_lpdf(real z, real delta, real mu, real sigma) {
real lprob;
lprob = (
-log((z - delta)*sigma*sqrt(2*pi())) -
(log(z - delta) - mu)^2 / (2*sigma^2)
);
return lprob;
}
real shiftlognormal_rng(real delta, real mu, real sigma) {
return delta + lognormal_rng(mu, sigma);
}
}
// In the data block, we specify everything that is relevant for
// specifying the data. Note that PRIOR_ONLY is a dummy variable used later.
data {
int<lower=1> n;
real y[n];
int<lower=0, upper=1> PRIOR_ONLY;
}
// In the parameters block, we specify all parameters we need.
// Although Stan implicitly adds flat prior on the (positive) real line
// we will specify informative priors below.
parameters {
real<lower=0> r;
real<lower=0> tau;
real<lower=0> delta;
real<lower=0> alpha;
real<lower=0> beta;
real<lower=0> sigma_e;
}
// In the model block, we specify our informative priors and
// the likelihood of the model, unless we want to sample only from
// the prior (i.e., if PRIOR_ONLY == 1)
model {
real mu;
target += cauchy_lpdf(tau | 0, 1);
target += lognormal_lpdf(delta | log(0.50), .5);
target += lognormal_lpdf(alpha | log(0.50), .5);
target += lognormal_lpdf(beta | 1, .5);
target += gamma_lpdf(r | 1, 3);
target += gamma_lpdf(sigma_e | 0.5, 5);
if (PRIOR_ONLY == 0) {
for (trial in 1:n) {
mu = log(alpha + beta * (tau + 1) / (tau + exp(r*trial)));
target += shiftlognormal_lpdf(y[trial] | delta, mu, sigma_e);
}
}
}
// In this block, we make posterior predictions (ypred) and compute
// the log likelihood of each data point given all the others (log_lik)
// which is needed for the computation of loo later
generated quantities {
real mu;
real ypred[n];
real log_lik[n];
for (trial in 1:n) {
mu = log(alpha + beta * (tau + 1) / (tau + exp(r*trial)));
ypred[trial] = shiftlognormal_rng(delta, mu, sigma_e);
log_lik[trial] = shiftlognormal_lpdf(y[trial] | delta, mu, sigma_e);
}
}</code></pre></figure>
<p>Visualizing the prior predictive distribution bares only limited insight into how these two models differ:</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-27-1.png" title="plot of chunk unnamed-chunk-27" alt="plot of chunk unnamed-chunk-27" style="display: block; margin: auto;" /></p>
<p>The Bayes factor in favour of the lognormal model is quite large:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit_ee_plateau_lognormal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="w">
</span><span class="n">comp_exp_delayed_log</span><span class="p">,</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="s1">'PRIOR_ONLY'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w">
</span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">chains</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">control</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="n">adapt_delta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.9</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">bayes_factor</span><span class="p">(</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee_plateau_lognormal</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">),</span><span class="w">
</span><span class="n">bridge_sampler</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">,</span><span class="w"> </span><span class="n">silent</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">,</span><span class="w"> </span><span class="n">method</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'warp3'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Estimated Bayes factor in favor of x1 over x2: 200.86885</code></pre></figure>
<p>In contrast, loo shows barely any evidence: we would <em>not</em> choose the lognormal model, but we remain undecided. (The warning is because we have one $k \in [0.5, 0.7]$.)</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">loo_compare</span><span class="p">(</span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee_plateau_lognormal</span><span class="p">),</span><span class="w"> </span><span class="n">loo</span><span class="p">(</span><span class="n">fit_ee_plateau</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## elpd_diff se_diff
## model1 0.0 0.0
## model2 -4.9 3.3</code></pre></figure>
<p>Using stacking, the weights result in a factor of about 24 in favour of the lognormal model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">stacking_weights</span><span class="p">(</span><span class="n">fit_ee_plateau_lognormal</span><span class="p">,</span><span class="w"> </span><span class="n">fit_ee_plateau</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Warning: Some Pareto k diagnostic values are slightly high. See help('pareto-k-diagnostic') for details.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Method: stacking
## ------
## weight
## 1 0.961
## 2 0.039</code></pre></figure>
<p>The contrast between the Bayes factor and loo can again be illuminated by looking at the posterior predictive distribution for the respective models; see below. The lognormal model can account for decrease in variance with increased practice. Still, the predictions of both models are rather similar, so that there is little to distinguish them using loo.</p>
<p><img src="/assets/img/2019-05-20-Law-of-Practice.Rmd/unnamed-chunk-31-1.png" title="plot of chunk unnamed-chunk-31" alt="plot of chunk unnamed-chunk-31" style="display: block; margin: auto;" /></p>
<h1 id="modeling-recap">Modeling recap</h1>
<p>Focusing on the exponential model, we have successively made our modeling more sophisticated (see also Evans et al., 2018). Barring priors, we have encountered the following models:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(1) \hspace{1em} &\text{RT} \sim \mathcal{N}\left(\alpha + \beta e^{-rN}, \sigma^2 \right) \\[1em]
(2) \hspace{1em} &\text{RT} \sim \mathcal{N}\left(\alpha + \beta \, \frac{\tau + 1}{\tau + e^{rN}}, \sigma^2 \right) \\[.5em]
(3) \hspace{1em} &\text{RT} \sim \text{Shifted-Lognormal}\left(\delta, \, \text{log}\left(\alpha' + \beta \frac{\tau + 1}{\tau + e^{rN}}\right), \, \sigma \right) \enspace .
\end{aligned} %]]></script>
<p>We went from (1) to (2) to account for learning plateaus; sometimes, participants take a while for learning to really kick in. We went from (2) to (3) to account for the fact that reaction times are decidedly nonnormal, and that there is a linear relationship between the mean and the standard deviation of reaction times.</p>
<p>We have also compared the power law to the exponential function, but so far we have looked only at simulated data. Since this blog post is already quite lengthy, we defer the treatment of real data to a future blog post.</p>
<!-- # Using real data -->
<!-- ```{r, echo = FALSE, fig.width = 10, fig.height = 5, fig.align = 'center', message = FALSE, warning = FALSE} -->
<!-- library('dplyr') -->
<!-- library('ggpubr') -->
<!-- library('ggplot2') -->
<!-- set.seed(1) -->
<!-- dat <- readRDS('stan-compiled/Evans-dat.RDS') -->
<!-- datf <- filter(dat, task == 2, id %in% sample(1:36, replace = TRUE, size = 4)) -->
<!-- ggplot(datf, aes(x = trial, y = RT)) + -->
<!-- geom_point(size = 1, alpha = .4) + -->
<!-- facet_wrap(~ id, scales = 'free', nrow = 2) + theme_pubclean() + -->
<!-- xlab('Trial') + -->
<!-- ylab('Reaction Time (sec)') -->
<!-- ``` -->
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we took a closer look at what has been dubbed the <em>law of practice</em>: the empirical fact that reaction time decreases as an exponential function (previously: power law) with practice. We compared two perspectives on prediction: one based on marginal likelihoods, and one based on leave-one-out cross-validation. The latter allows the model to gorge on data, update its parameters, and then make predictions based on the <em>posterior predictive distribution</em>, while the former forces the model to make predictions using the <em>prior predictive distribution</em>. We have implemented the power law and exponential model in Stan, and extended the latter to model an initial learning plateau and account for the empirical observation that not only mean reaction time decreases, but also its variance.</p>
<hr />
<p><em>I would like to thank Nathan Evans, Quentin Gronau, and Don van den Bergh for helpful comments on this blog post.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Blitzstein, J. K., & Hwang, J. (<a href="https://projects.iq.harvard.edu/stat110/home">2014</a>). <em>Introduction to Probability</em>. London, UK: Chapman and Hall/CRC.</li>
<li>Evans, N. J., Brown, S. D., Mewhort, D. J., & Heathcote, A. (<a href="https://psycnet.apa.org/record/2018-30695-005">2018</a>). Refining the law of practice. <em>Psychological Review, 125</em>(4), 592-605.</li>
<li>Fong, E., & Holmes, C. (<a href="https://arxiv.org/abs/1905.08737">2019</a>). On the marginal likelihood and cross-validation. arXiv preprint arXiv:1905.08737.</li>
<li>Gelman, A., & Shalizi, C. R. (<a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/j.2044-8317.2011.02037.x">2013</a>). Philosophy and the practice of Bayesian statistics. <em>British Journal of Mathematical and Statistical Psychology, 66</em>(1), 8-38.</li>
<li>Gronau, Q. F., & Wagenmakers, E. J. (<a href="https://link.springer.com/article/10.1007/s42113-018-0011-7">2019a</a>). Limitations of Bayesian leave-one-out cross-validation for model selection. <em>Computational Brain & Behavior, 2</em>(1), 1-11.</li>
<li>Gronau, Q. F., & Wagenmakers, E. J. (<a href="https://link.springer.com/article/10.1007/s42113-018-0022-4">2019b</a>). Rejoinder: More limitations of Bayesian leave-one-out cross-validation. Computational brain & behavior, 2(1), 35-47.</li>
<li>Heathcote, A., Brown, S., & Mewhort, D. J. K. (<a href="https://link.springer.com/article/10.3758/BF03212979">2000</a>). The power law repealed: The case for an exponential law of practice. <em>Psychonomic bulletin & Review, 7</em>(2), 185-207.</li>
<li>Lee, M. D., & Vanpaemel, W. (<a href="https://link.springer.com/article/10.3758/s13423-017-1238-3">2018</a>). Determining informative priors for cognitive models. <em>Psychonomic Bulletin & Review, 25</em>(1), 114-127.</li>
<li>Lee, M. D. (<a href="https://osf.io/zky2v/">2018</a>). Bayesian methods in cognitive modeling. <em>Stevens’ Handbook of Experimental Psychology and Cognitive Neuroscience, 5</em>, 1-48.</li>
<li>Myung, I. J., Kim, C., & Pitt, M. A. (<a href="https://link.springer.com/article/10.3758/BF03198418">2000</a>). Toward an explanation of the power law artifact: Insights from response surface analysis. <em>Memory & Cognition, 28</em>(5), 832-840.</li>
<li>Newell, A. (1973). You can’t play 20 questions with nature and win: Projective comments on the papers of this symposium. In W. G. Chase (Ed.), <em>Visual information processing</em> (pp. 283-308). New York, US: Academic Press.</li>
<li>Newell, A., & Rosenbloom, P. S. (1981). Mechanisms of skill acquisition and the law of practice. In J.R. Anderson (Ed.), <em>Cognitive skills and their acquisition</em> (pp. 1-55). Hillsdale, NJ: Erlbaum.</li>
<li>Vehtari, A., Gelman, A., & Gabry, J. (<a href="https://arxiv.org/abs/1507.02646">2015</a>). Pareto smoothed importance sampling. arXiv preprint arXiv:1507.02646.</li>
<li>Vehtari, A., Simpson, D. P., Yao, Y., & Gelman, A. (<a href="https://link.springer.com/article/10.1007/s42113-018-0020-6">2019</a>). Limitations of “Limitations of Bayesian leave-one-out cross-validation for model selection”. <em>Computational Brain & Behavior, 2</em>(1), 22-27.</li>
<li>Vehtari, A., Gelman, A., & Gabry, J. (<a href="https://link.springer.com/article/10.1007/s11222-016-9696-4">2017</a>). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. <em>Statistics and Computing, 27</em>(5), 1413-1432.</li>
<li>Wagenmakers, E. J., & Brown, S. (<a href="http://www.ejwagenmakers.com/2007/WagenmakersBrown2007.pdf">2007</a>). On the linear relation between the mean and the standard deviation of a response time distribution. <em>Psychological Review, 114</em>(3), 830-841.</li>
<li>Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (<a href="https://projecteuclid.org/euclid.ba/1516093227">2018</a>). Using stacking to average Bayesian predictive distributions (with discussion). <em>Bayesian Analysis, 13</em>(3), 917-1003.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<!-- There are at least two ways to see this. For simplicity, let $\alpha = 0$. Taking logarithms on both sides, the two equations become: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{log} \, f_e(N) &= \text{log} \, \beta - r N \hspace{3em} (1) \\[.5em] -->
<!-- \text{log} \, f_p(N) &= \text{log} \, \beta - r \, \text{log} \, N \hspace{1.4em} (2) -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- We see that the power law model (2) is a linear function in $r$ (on the log scale). The exponential model (1), in contrast, is a non-linear function in $r$ (on the log scale), converging to the asymptote much quicker than the power law model. Moreover, the exponential model allows only for a constant difference between the reaction times on different trials (on the log scale); the difference in log reaction time on one trial and the trial right after is $r$. In contrast, the power model (2) allows that the difference scales in $N$. In the beginning of the practice trials, the difference between trials is comparatively large; for example, $\text{log}(1) - \text{log}(2) = -0.69$. With increasing $N$, the differences between the log reaction times gets smaller and smaller; for example, $\text{log}(10) - \text{log}(11) = -0.10$. Therefore, learning slows down with increased practice. -->
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This blog post is heavily based on the modeling work of Evans et al. (2018). If you want to know more, I encourage you to check out the paper! <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>You may also rewrite the power law as an exponential function, i.e., $f_p(N) = \alpha + \beta e^{-r \, \text{log} N}$, to see their algebraic difference more clearly. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>As it turns out, there is a connection between the two: the “[…] the marginal likelihood is formally equivalent to exhaustive leave-$p$-out cross-validation averaged over all values of $p$ and all held-out test sets when using the log posterior predictive probability as a scoring rule.” (Fong & Holmes, 2019). <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>Note that, if a simpler model is not ruled out by the data, then it might be reasonable to obtain evidence in favour of it, even though a more complex model has generated the data. For $n \rightarrow \infty$, we would probably still want to have consistent model selection; that is, select the model which actually generated the data, assuming that it is in the set of models we are considering. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>For data points for which $k > 0.7$, Vehtari et al. (2017) suggest to compute the predictive density directly, instead of relying on Pareto smoothed importance sampling. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>I am not sure whether Vehtari would like this “Stacked Pseudo Bayes factor”. <a href="https://discourse.mc-stan.org/t/interpreting-elpd-diff-loo-package/1628/2">Here</a>, he seems to suggest that one should choose a model when it’s stacking weight is 1. Otherwise, I suppose his philosophy is aligned with that of Gelman & Shalizi (2013), i.e., expand the model so that it can account for whatever the other model does better. Update: <a href="https://twitter.com/avehtari/status/1134121009282539521">Here’s</a> Vehtari himself. Three cheers for the web! <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>Although there is a temporal order in the data, as trial 21 cannot, for example, come before trial 12, we are not particularly interested in predicting the future using past observations. Therefore, using vanilla loo seems adequate. If one is interested in predicting future observations, one could use approximate <em>leave-future-out</em> cross-validation, see Bürkner, Gabry, & Vehtari (<a href="https://arxiv.org/abs/1902.06281">2019</a>). <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>Michael Lee argues that this idea of trying to recover the “true” model is misguided. Bayes’ rule gives the optimal means to select among models. Given a particular data set, if the Bayes factor favours a model that did, in fact, not generate the data, it is still correct to select this model. After all, it predicts the data better than the model that generated it. Lee (2018) distinguishes between <em>inference</em> (saying what follows from a model and data) and <em>inversion</em> (recovering truth). <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>As Lee (2018, p. 27) points out, the word “fit” is unhelpful in a Bayesian context. There are no degrees of freedom; once the model is specified, inference follows directly from probability theory. So “updating a model” is better terminology than “fitting a model”. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>One can also do this without writing a custom function by using the standard lognormal function, and then just subtracting the shift parameter $\delta$ in the function call. <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderPractice makes better. And faster. But what exactly is the relation between practice and reaction time? In this blog post, we will focus on two contenders: the power law and exponential function. We will implement these models in Stan and extend them to account for learning plateaus and the fact that, with increased practice, not only the mean reaction time but also its variance decreases. We will contrast two perspectives on predictive model comparison: a (prior) predictive perspective based on marginal likelihoods, and a (posterior) predictive perspective based on leave-one-out cross-validation. So let’s get started!1 Two models We can model the relation between reaction time (in seconds) and the number of practice trials as a power law function. Let $f: \mathbb{N} \rightarrow \mathbb{R}^+$ be a function that maps the number of trials to reaction times. We write where $\alpha$ is a lower bound (one cannot respond faster than that due to processing and motor control limits); $\beta$ is the learning gain from practice with respect to the first trial ($N = 1$); $N$ indexes the particular trial; and $r$ is the learning rate. Similarly, we can write where the parameters have the same interpretation, except that $\beta$ is the learning gain from practice compared to no practice ($N = 0$).2 What is the main difference between those two functions? The exponential model assumes a constant learning rate, while the power model assumes diminishing returns. To see this, let $\alpha = 0$, and ignore for the moment that $N$ is discrete. Taking the derivative for the power law model results in which shows that the local learning rate — the change in reaction time as a function of $N$ — is $-r/N$; it depends on how many trials have been completed previously. The more one has practiced, the smaller the local learning rate $-r / N$. The exponential function, in contrast, shows no such dependency on practice: The figure above visualizes two data sets, generated from either a power law (left) or an exponential model (right), as well as the maximum likelihood fit of both models to these data. It is rather difficult to tell which model performs best just by eyeballing the fit. We thus need to engage in a more formal way of comparing models. Two perspectives on prediction Let’s agree that the best way to compare models is to look at predictive accuracy. Predictive accuracy with respect to what? The figure below illustrates two different answers one might give. The grey shaded surface represents unobserved data; the white island inside is the observed data; the model is denoted by $\mathcal{M}$. On the left, the model makes predictions before seeing any data by means of its prior predictive distribution. The predictive accuracy is then evaluated on the actually observed data. In contrast, on the right, the model makes predictions after seeing the data by means of its posterior predictive distribution. In principle, its predictive accuracy is evaluated on data one does not observe (visualized as the grey area). One can estimate this expected out-of-sample predictive accuracy by cross-validation procedures which partition the observed data into a training and test set. The model only sees the training set, and makes predictions for the unseen test set. One key practical distinction between these two perspectives is how predictions are generated. On the left, predictions are generated from the prior. On the right, the prior first gets updated to the posterior using the observed data, and it is through the posterior that predictions are made. In the next two sections, we make this difference more precise.3 Prior prediction: Marginal likelihoods From this perspective, we weight the model’s prediction of the observed data given a particular parameter setting by the prior. This is accomplished by integrating the likelihood with respect to the prior, which gives the so-called marginal likelihood of a model $\mathcal{M}$: It is clear that the prior matters a great deal, but that is no surprise — it is part of the model. A bad prior means a bad model. The ratio of two such marginal likelihoods is known as the Bayes factor. If one is willing to assign priors to models, one can compute posterior model probabilities, i.e., where $K$ is the number of models under consideration (see also a previous blogpost). Observe that the marginal likelihood features prominently: it is an updating factor from prior to posterior model probability. With this, one can also compute Bayesian model-averaged predictions: where $\tilde{y}$ is unseen data, and where the prediction of each model gets weighted by its posterior probability. We denote a model weight, which in this case is its posterior probability, as $w_k$. Posterior prediction: Leave-one-out cross-validation Another perspective aims to estimate the expected out-of-sample prediction error, or expected log predictive density, i.e., where the expectation is taken with respect to unseen data $\tilde{y}$ (which is visualized as a grey surface with an $?$ inside in the figure above). Clearly, as we do not have access to unseen data, we cannot evaluate this. However, one can approximate this quantity by computing the leave-one-out prediction error in our sample: where $y_i$ is the $i^{\text{th}}$ data point, and $y_{-i}$ are all data points except $y_i$, where we have suppressed conditioning on $\mathcal{M}$ to not clutter notation (even more), and where Observe that this requires integrating over the posterior distribution of $\theta$ given all but one data point; this is in contrast to the marginal likelihood perspective, which requires integration with respect to the prior distribution. From this perspective, one can similarly compute model weights $w_k$ where $\widehat{\text{elpd}}_{\text{loo}}^{\mathcal{M}_k}$ is the loo estimate for the expected log predictive density for model $\mathcal{M}_k$. For prediction, one averages across models using these Pseudo Bayesian model-averaging weights (Yao et al., 2018, p. 92). However, Yao et al. (2018) and Vehtari et al. (2019) recommend against using these Pseudo BMA weights, as they do not take the uncertainty of the loo estimates into account. Instead, they suggest using Pseudo-BMA+ weights or stacking. For details, see Yao et al. (2018). For an illuminating discussion about model selection based on marginal likelihoods or leave-one-out cross-validation, see Gronau & Wagenmakers (2019a), Vehtari et al. (2019), and Gronau & Wagenmakers (2019b). Now that we have taken a look at these two perspectives on prediction, in the next section, we will implement the power law and the exponential model in Stan. Implementation in Stan As is common in psychology, people do not deterministically follow a power law or an exponential law. Instead, the law is probabilistic: given the same task, the person will respond faster or slower, never exactly as before. To allow for this, we assume that there is Gaussian noise around the function value. In particular, we assume that Note that are not normally distributed; we address this later. We make the same assumption for the exponential model. The following code implements the power law model in Stan. This blog post is heavily based on the modeling work of Evans et al. (2018). If you want to know more, I encourage you to check out the paper! ↩ You may also rewrite the power law as an exponential function, i.e., $f_p(N) = \alpha + \beta e^{-r \, \text{log} N}$, to see their algebraic difference more clearly. ↩ As it turns out, there is a connection between the two: the “[…] the marginal likelihood is formally equivalent to exhaustive leave-$p$-out cross-validation averaged over all values of $p$ and all held-out test sets when using the log posterior predictive probability as a scoring rule.” (Fong & Holmes, 2019). ↩Two perspectives on regularization2019-04-15T12:00:00+00:002019-04-15T12:00:00+00:00https://fabiandablander.com/r/Regularization<p>Regularization is the process of adding information to an estimation problem so as to avoid extreme estimates. Put differently, it safeguards against foolishness. Both Bayesian and frequentist methods can incorporate prior information which leads to regularized estimates, but they do so in different ways. In this blog post, I illustrate these two different perspectives on regularization on the simplest example possible — estimating the bias of a coin.</p>
<!-- When I am too lazy to cook porridge, I usually buy bread from the local bakery and have bread with (vegan) butter for breakfast. Assume I am unusually clumsy, and my freshly spread slice of bread slips out of my hand, onto the floor. Did the butter land on the floor? Yes! How can we model this process? -->
<h2 id="modeling-coin-flips">Modeling coin flips</h2>
<p>Let’s say that we are interested in estimating the bias of a coin, which we take to be the probability of the coin showing heads.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> In this section, we will derive the Binomial likelihood — the statistical model that we will use for modeling coin flips. Let $X \in [0, 1]$ be a discrete random variable with realization $X = x$. Flipping the coin once, let the outcome $x = 0$ correspond to tails and $x = 1$ to heads. We use the Bernoulli likelihood to connect the data to the latent parameter $\theta$, which we take to be the bias of the coin:</p>
<script type="math/tex; mode=display">p(x \mid \theta) = \theta^x (1 - \theta)^{1 - x} \enspace .</script>
<p>There is no point in estimating the bias by flipping the coin only once. We are therefore interested in a model that can account for $n$ coin flips. If we are willing to assume that the individual coin flips are <em>independent and identically</em> distributed conditional on $\theta$, we can obtain the joint probability of all outcomes by multiplying the probability of the individual outcomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(x_1, \ldots, x_n \mid \theta) &= \prod_{i=1}^n p(x_i \mid \theta) \\[.5em]
&= \prod_{i=1}^n \theta^{x_i} (1 - \theta)^{1 - x_i} \\[.5em]
&= \theta^{\sum_{i=1}^n x_i} (1 - \theta)^{ \sum_{i=1}^n 1 - x_i} \enspace .
\end{aligned} %]]></script>
<p>For the purposes of estimating the coin’s bias, we actually do not care about the order in which the coins come up heads or tails; we only care about how frequently the coin shows heads or tails out of $n$ throws. Thus, we do not model the individual outcomes $X_i$, but instead model their sum $Y = \sum_{i=1}^n X_i$. We write:</p>
<script type="math/tex; mode=display">p(y \mid \theta) = \theta^{y} (1 - \theta)^{n - y} \enspace ,</script>
<p>where we suppress conditioning on $n$ to not clutter notation. Note that our model is not complete — we need to account for the fact that there are several ways to get $y$ heads out of $n$ throws. For example, we can get $y = 2$ with $n = 3$ in three different ways: $(1, 1, 0)$, $(0, 1, 1)$, and $(1, 0, 1)$. If we were to use the model above, we would underestimate the probability of observing two heads out of three coin tosses by a factor of three.</p>
<p>In general, there are $n!$ possible ways in which we can order the outcomes. To see this, think of $n$ containers. The first outcome can go in any container, the second one in any container but the container which houses the first outcome, and so on, which yields:</p>
<script type="math/tex; mode=display">n \times (n - 1) \times (n - 2) \ldots \times 1 = n! \enspace .</script>
<p>However, we only care about $y$ of them, so we need to remove the remaining $(n - y)!$ possible ways. Moreover, once we have taken $y$ outcomes, we do not care about <em>their</em> order; thus we remove another $y!$ permutations. Therefore, for any particular sequence of coin flips of length $n$, there are</p>
<script type="math/tex; mode=display">\frac{n!}{y!(n - y)!} = {n \choose y}</script>
<p>ways to get $y$ heads out of $n$ throws. The funny looking symbol on the right is the <em>Binomal coefficient</em>. The probability of the data is therefore given by the Binomial likelihood:</p>
<script type="math/tex; mode=display">p(y \mid \theta) = {n \choose y} \theta^y (1 - \theta)^{n - y} \enspace ,</script>
<p>which just adds the term ${n \choose y}$ to the equation we had above after introducing $Y$. For the example of observing $y = 2$ heads out of $n = 3$ coin flips, the Binomial coefficient is ${3 \choose 2} = 3$, which accounts for the fact that there are three possible ways to get two heads out of three throws.</p>
<h2 id="the-data">The data</h2>
<p>Assume we flip the coin three times, $n = 3$, and observe three heads, $y = 3$. How can we estimate the bias of the coin? In the next sections, we will use the Binomial likelihood derived above and discuss three different ways of estimating the coin’s bias: maximum likelihood estimation, Bayesian estimation, and penalized maximum likelihood estimation.</p>
<h2 id="classical-estimation">Classical estimation</h2>
<p>Within the frequentist paradigm, the method of maximum likelihood is arguably the most popular method for parameter estimation: choose as an estimate for $\theta$ the value which maximizes the likelihood of the data.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> To get a feeling for how the likelihood of the data differs across values of $\theta$, let’s pick two values, $\theta_1 = .5$ and $\theta_2 = 1$, and compute the likelihood of observing three heads out of three coin flips:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y = 3 \mid \theta = .5) &= {3 \choose 3} .5^3 (1 - .5)^{3 - 3} = 0.125 \\[.5em]
p(y = 3 \mid \theta = 1) &= {3 \choose 3} 1^3 (1 - 1)^{3 - 3} = 1 \enspace .
\end{aligned} %]]></script>
<p>We therefore conclude that the data are more likely for a coin that has bias $\theta_1 = 1$ than for a coin that has bias $\theta_2 = 0.5$. But is it the <em>most</em> likely value? To compare all possible values for $\theta$ visually, we plot the likelihood as a function of $\theta$ below. The left figure shows that, indeed, $\theta = 1$ maximizes the likelihood for the data. The right figure shows the likelihood function for $y = 15$ heads out of $n = 20$ coin flips. Note that, in contrast to probabilities, which need to sum to one, likelihoods do not have a natural scale.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>Do these two examples allow us to derive a general principle for how to estimate the bias of a coin? Let $\hat{\theta}$ denote an estimate of the population parameter $\theta$. The two figures above suggests that $\hat{\theta} = \frac{y}{n}$ is the maximum likelihood estimate for an arbitrary data set $d = (y, n)$ … and it is! To arrive at this mathematically, we can find the maximum of this likelihood function by taking the derivative with respect to $\theta$, and setting it to zero (see also a <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">previous</a> post). In other words, we solve for the value of $\theta$ for which the derivative does not change; and since the Binomial likelihood is unimodal, this maximum will be unique. Note the value for $\theta$ at which the likelihood function has its maximum does not change when we take logs, but because the mathematics is greatly simplified, we do so:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= \frac{\partial}{\partial \theta}\text{log}\left({n \choose y} \theta^y (1 - \theta)^{n - y}\right) \\[.5em]
0 &= \frac{\partial}{\partial \theta}\left(\text{log}{n \choose y} + y \, \text{log}\theta + (n - y) \, \text{log}(1 - \theta)\right) \\[.5em]
0 &= \frac{y}{\theta} - \frac{n - y}{1 - \theta}\\[.5em]
\frac{n - y}{1 - \theta} &= \frac{y}{\theta} \\[.5em]
\theta (n - y) &= (1 - \theta) y \\[.5em]
\theta n - \theta y &= y - \theta y \\[.5em]
\theta n &= y \\[.5em]
\theta &= \frac{y}{n} \enspace ,
\end{aligned} %]]></script>
<p>which shows that indeed $\frac{y}{n}$ is the maximum likelihood estimate.</p>
<h2 id="bayesian-estimation">Bayesian estimation</h2>
<p>Bayesians assign priors to parameters in addition to the likelihood, which takes a central role in all statistical paradigms. For this Binomial problem, we assign $\theta$ a Beta prior:</p>
<script type="math/tex; mode=display">p(\theta) = \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \enspace .</script>
<p>As we will see below, this prior allows easy Bayesian updating while being sufficiently flexible in incorporating prior information. The figure below shows different Beta distributions, formalizing our prior belief about values of $\theta$.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>The figure in the top left corner assigns uniform prior plausibility to all values of $\theta$; the figures to its right incorporate a slight bias towards the extreme values $\theta = 1$ and $\theta = 0$. With increasing $a$ and $b,$ the prior becomes more biased towards $\theta = 0.5$; with decreasing $a$ and $b$, the prior becomes biased against $\theta = 0.5$.</p>
<p>As shown in a <a href="https://fdabl.github.io/r/Spike-and-Slab.html">previous</a> blog post, the Beta distribution is <em>conjugate</em> to the Binomial likelihood, which means that the posterior distribution of $\theta$ is again a Beta distribution:</p>
<script type="math/tex; mode=display">p(\theta \mid y) = \frac{1}{\text{B}(a', b')} \theta^{a' - 1} (1 - \theta)^{b' - 1} \enspace ,</script>
<p>where $a’ = a + y$ and $b’ = b + y - n$. Under this conjugate setup, the parameters of the prior can be understood as prior data; for example, if we choose prior parameters $a = b = 1$, then we assume that we have seen one heads and one tails prior to data collection. The figure below shows two examples of such Bayesian updating processes.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>In both cases, we observe $y = 3$ heads out of $n = 3$ coin flips. On the left, we assign $\theta$ a uniform prior. The resulting posterior distribution is proportional to the likelihood (which we have rescaled to fit nicely in the graph) and thus does not appear as a separate line. After we have seen the data, we can compute the posterior mode as our estimate for the most likely value of $\theta$. Observe that the posterior mode is equivalent to the maximum likelihood estimate:</p>
<script type="math/tex; mode=display">\hat{\theta}_{\text{PM}} = \frac{a' - 1}{a' + b' - 2} = \frac{1 + y - 1}{1 + y + 1 + n - y - 2} = \frac{y}{n} = \hat{\theta}_{\text{MLE}} \enspace .</script>
<p>This is in fact the case for all statistical estimation problems where we assign a uniform prior to the (possibly high-dimensional) parameter vector $\theta$. To prove this, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\hat{\theta}_{\text{PM}} &= \underset{\theta}{\text{argmax}} \, \frac{p(y \mid \theta) \, p(\theta)}{p(y)} \\[.5em]
&= \underset{\theta}{\text{argmax}} \, p(y \mid \theta) \\[.5em]
&= \hat{\theta}_{\text{MLE}} \enspace ,
\end{aligned} %]]></script>
<p>since we can drop the normalizing constant $p(y)$, because it does not depend on $\theta$, and $p(\theta)$, because it is a constant assigning all values of $\theta$ equal probability.</p>
<p>Using a Beta prior with $a = b = 2$, as shown on the right side of the figure above, we see that the posterior is not proportional to the likelihood anymore. This in turn means that the mode of the posterior distribution does no longer correspond to the maximum likelihood estimate. In this case, the posterior mode is:</p>
<script type="math/tex; mode=display">\hat{\theta}_{\text{PM}} = \frac{5 - 1}{5 + 2 - 2} = 0.8 \enspace .</script>
<p>In contrast to earlier, this estimate is <em>shrunk</em> towards $\theta = 0.5$. This came about because we have used prior information that stated that $\theta = 0.5$ is more likely than the other values (see figure with $a = b = 2$ above). Consequently, we were therefore less swayed by the somewhat unlikely situation (under no bias $\theta = 0.5$) of observing three heads out of three throws. It should thus not come as a surprise that Bayesian priors <em>can</em> act as regularizing devices. However, this requires careful application, especially in small sample size settings.</p>
<p>In a <em>Post Scriptum</em> to this blog post, I similarly show how the posterior mean, which is arguably are more natural point estimate as it takes the uncertainty about $\theta$ better into account than the posterior mode, can be viewed as a regularized estimate, too.</p>
<h2 id="penalized-estimation">Penalized estimation</h2>
<p>Bayesians are not the only ones who can add prior information to an estimation problem. Within the frequentist framework, penalized estimation methods add a penalty term to the log likelihood function, and then find the parameter value which maximizes this <em>penalized log likelihood</em>. We can implement such a method by optimizing an extended log likelihood:</p>
<script type="math/tex; mode=display">y \, \text{log}\,\theta + (n - y) \, \text{log} \, (1 - \theta) - \underbrace{\lambda (\theta - 0.5)^2}_{\text{Penalty Term}} \enspace ,</script>
<p>where we penalize values that a far from the parameter value which indicates no bias, $\theta = 0.5$. The larger $\lambda$, the stronger values of $\theta \neq 0.5$ get penalized. In addition to picking $\lambda$, the particular form of the penalty term is also important. Similar to assigning $\theta$ a prior distribution, although possibly less straightforward and less flexible, choosing the penalty term means incorporating information about the problem in addition to specifying a likelihood function. Above, we have used the <em>squared distance</em> from $\theta = 0.5$ as a penalty. We call this the $\mathcal{L}_2$-norm penalty<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>, but the $\mathcal{L}_1$-norm, which takes the <em>absolute distance</em>, is an equally interesting choice:</p>
<script type="math/tex; mode=display">y \, \text{log}\,\theta + (n - y) \, \text{log} \, (1 - \theta) - \lambda |\theta - 0.5| \enspace ,</script>
<p>As we will see below, these penalties have very different effects.</p>
<p>The penalized likelihood does not only depend on $\theta$, but also on $\lambda$. The code below evaluates the penalized log likelihood function given values for these two parameters. Note that we drop the normalizing constant ${n \choose y}$ as it does neither depend on $\theta$ nor on $\lambda$.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0.001</span><span class="p">,</span><span class="w"> </span><span class="m">.999</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">),</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">lambda</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">theta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">reg</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">get_penalized_likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">Vectorize</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span></code></pre></figure>
<p>With only three data points it is futile to try to estimate $\lambda$ using, for example, cross-validation; however, this is also not the goal of this blog post. Instead, to get further intuition, we simply try out a number of values for $\lambda$ using the code below and see how it influences our estimate of $\theta$. Because the parameter space has only one dimension, we can easily find the value for $\theta$ which maximizes the penalized likelihood even without wearing our calculus hat. Specifically, given a particular value for $\lambda$, we evaluate the penalized likelihood function for a range of values of between zero and one and pick the value that minimizes it.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">estimate_path</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lambda_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">.01</span><span class="p">)</span><span class="w">
</span><span class="n">theta_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">.001</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">)</span><span class="w">
</span><span class="n">theta_best</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="nf">seq_along</span><span class="p">(</span><span class="n">lambda_seq</span><span class="p">),</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">i</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">penalized_likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_penalized_likelihood</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">theta_seq</span><span class="p">,</span><span class="w"> </span><span class="n">lambda_seq</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">reg</span><span class="p">)</span><span class="w">
</span><span class="n">theta_seq</span><span class="p">[</span><span class="n">which.max</span><span class="p">(</span><span class="n">penalized_likelihood</span><span class="p">)]</span><span class="w">
</span><span class="p">})</span><span class="w">
</span><span class="n">cbind</span><span class="p">(</span><span class="n">lambda_seq</span><span class="p">,</span><span class="w"> </span><span class="n">theta_best</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Sticking with the observations of three heads ($y = 3$) out of three throws ($n = 3$), the figure below plots best fitting values for $\theta$ given a range of values for $\lambda$. Observe that the $\mathcal{L}_1$-norm penalty shrinks it more quicker and abruptly to $\theta = 0.5$ at $\lambda = 6$, while the $\mathcal{L}_2$-norm penalty gradually (and rather slowly) shrinks the parameter to $\theta = 0.5$ with increasing $\lambda$. Why is this so?</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>First, note that because $\theta \in [0, 1]$ the squared distance will always be smaller than the absolute distance, which explains the slower shrinkage. Second, the fact that the $\mathcal{L}_1$-norm penalty can shrink <em>exactly</em> to $\theta = 0.5$ is due to the discontinuity of the absolute value function. The figures below provides some intuition. In particular, the figure on the left shows the $\mathcal{L}_1$-norm penalized likelihood function for a select number of $\lambda$’s. We see that for $\lambda < 3$, the value $\theta = 1$ performs best. With $\lambda \in [3, 6]$, values of $\theta \in [0.5, 1]$ become more likely than the extreme estimate $\theta = 1$. For $\lambda \geq 6$, the ‘no bias’ value $\theta = 0.5$ maximizes the penalized likelihood. Due to the discontinuity in the penalty, the shrinkage is exact. The $\mathcal{L}_2$-norm penalty, on the other hand, shrinks less strongly, and never exactly to $\theta = 0.5$, except of course for $\lambda \rightarrow \infty$. We can see this in the right figure below, where the penalized likelihood function is merely shifted to the left with increasing $\lambda$; this is in contrast to the $\mathcal{L}_1$-norm penalized likelihood on the left, for which the value $\theta = 0.5$ at the discontinuity takes a special place.</p>
<p><img src="/assets/img/2019-04-14-Regularization.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>You can play around with the code below to get an intuition for how different values of $\lambda$ influence the penalized likelihood function.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'latex2exp'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_pen_llh</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Penalized Likelihood'</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">nl</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">lambdas</span><span class="p">)</span><span class="w">
</span><span class="n">theta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">.001</span><span class="p">,</span><span class="w"> </span><span class="m">.999</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">)</span><span class="w">
</span><span class="n">likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nl</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">theta</span><span class="p">))</span><span class="w">
</span><span class="n">normalize</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="nf">max</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">min</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">nl</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">log_likelihood</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">get_penalized_likelihood</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">reg</span><span class="p">)</span><span class="w">
</span><span class="n">likelihood</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">normalize</span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">log_likelihood</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="n">likelihood</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylab</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="s1">'$\\theta$'</span><span class="p">),</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w">
</span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'skyblue'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nl</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="n">likelihood</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">],</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ylab</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">i</span><span class="p">,</span><span class="w">
</span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'skyblue'</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">at</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.2</span><span class="p">))</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">info</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">l</span><span class="p">)</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="n">sprintf</span><span class="p">(</span><span class="s1">'$\\lambda = %.2f$'</span><span class="p">,</span><span class="w"> </span><span class="n">l</span><span class="p">)))</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="s1">'topleft'</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">info</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">nl</span><span class="p">),</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w">
</span><span class="n">box.lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'skyblue'</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">lambdas</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">,</span><span class="w"> </span><span class="m">8</span><span class="p">)</span><span class="w">
</span><span class="n">plot_pen_llh</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="s1">'$L_1$ Penalized Likelihood'</span><span class="p">))</span><span class="w">
</span><span class="n">plot_pen_llh</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">lambdas</span><span class="p">,</span><span class="w"> </span><span class="n">reg</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">TeX</span><span class="p">(</span><span class="s1">'$L_2$ Penalized Likelihood'</span><span class="p">))</span></code></pre></figure>
<p>In practice, one would reparameterize this model as a logistic regression, and use cross-validation to estimate the best value for $\lambda$; see the <em>Post Scriptum</em> for a sketch of this approach.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen two perspectives regularization illustrated on a very simple example: estimating the bias of a coin. We first derived the Binomial likelihood, connecting the data to a parameter $\theta$ which we took to be the bias of the coin, as well as the maximum likelihood estimate. Observing three heads out of three coin flips, we became slightly uncomfortable with the (extreme) estimate $\hat{\theta} = 1$. We have seen how, from a Bayesian perspective, one can add prior information to this estimation problem, and how this led to an estimate that was <em>shrunk</em> towards $\theta = 0.5$. Within the frequentist framework, one can add information by augmenting the likelihood function with a penalty term. The type of information we want to incorporate corresponds to the particular penalty term. In this blog post, we have focused on the most commonly used penalty terms: the $\mathcal{L}_1$-norm, which shrinks parameters exactly to a particular value; and the $\mathcal{L}_2$-norm penalty, which provides continuous shrinkage. A future blog post might look into linear regression models where regularization methods abound and study how, for example, the popular Lasso can be recast in Bayesian terms.</p>
<hr />
<p><em>I would like to thank Jonas Haslbeck, Don van den Bergh, and Sophia Crüwell for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="posterior-mean">Posterior mean</h3>
<p>You may argue that one should use the mean instead of the mode as a posterior summary measure. If one does this, then there is already some shrinkage for the case of uniform priors. The mean of the posterior distribution is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[\theta] &= \frac{a'}{a' + b'} \\[.5em]
&= \frac{a + y}{a + y + b + n - y} \\[.5em]
&= \frac{a + y}{a + b + n} \enspace .
\end{aligned} %]]></script>
<p>As so often in mathematics, we can rewrite this in a more complicated manner to gain insight into how Bayesian priors shrink estimates:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}[\theta] &= \frac{a}{a + b + n} + \frac{y}{a + b + n} \\[.5em]
&= \frac{a}{a + b + n} \left(\frac{a + b}{a + b}\right) + \frac{y}{a + b + n} \left( \frac{n}{n} \right) \\[.5em]
&= \frac{a + b}{a + b + n} \underbrace{\left(\frac{a}{a + b}\right)}_{\text{Prior mean}} + \frac{n}{a + b + n} \underbrace{\left( \frac{y}{n} \right)}_{\text{MLE}} \enspace .
\end{aligned} %]]></script>
<p>This decomposition shows that the posterior mean is a weighted combination of the prior mean and the maximum likelihood estimate. Since we can think of $a + b$ as the prior data, note that $a + b + n$ can be thought of as the <em>total</em> number of data points. The prior mean is thus weighted be the proportion of prior to total data, while the maximum likelihood estimate is weighted by the proportion of sample data to total data. This provides another perspective on how Bayesian priors regularize estimates.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<h3 id="penalized-logistic-regression">Penalized logistic regression</h3>
<p>Cross-validation might be a bit awkward when we represent the data using only $y$ and $n$. We can go back to the product of Bernoulli representation, which uses all individual data points $x_i$. This results in a logistic regression problem with likelihood:</p>
<script type="math/tex; mode=display">p(x_1, \ldots, x_n \mid \beta) = \prod_{i=1}^n \left(\frac{1}{1 + \text{exp}^{-\beta}}\right)^{x_i} \left(1 - \frac{1}{1 + \text{exp}^{-\beta}}\right)^{1 - x_i}\enspace ,</script>
<p>where we use a sigmoid function as the link function, and $\beta$ is on the log odds scale. The penalized log likelihood function can be written as</p>
<script type="math/tex; mode=display">\sum_{i=1}^n \left[ x_i \, \text{log} \left(\frac{1}{1 + \text{exp}^{-\beta}}\right) + (1 - x_i) \left(1 - \frac{1}{1 + \text{exp}^{-\beta}}\right) \right] - \lambda |\beta| \enspace ,</script>
<p>where because $\beta = 0$ corresponds to $\theta = 0.5$, we do not need to subtract in the penalty term. This parameterization also makes it more easy to study which types of priors on $\beta$ result in an $\mathcal{L}_1$ or $\mathcal{L}_2$ norm penalty (spoiler: it’s a Laplace and the Gaussian, respectively). Such models can be estimated using the R package <em>glmnet</em>, although it does not work for the exceedingly small sample we have played with in this blog post. This seems to imply that regularization is more natural in the Bayesian framework, which additionally allows more flexible specification of prior knowledge.</p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Gelman, A., & Nolan, D. (<a href="https://www.tandfonline.com/doi/abs/10.1198/000313002605">2002</a>). You can load a die, but you can’t bias a coin. <em>The American Statistician, 56</em>(4), 308-311.</li>
<li>Stigler, S. M. (<a href="https://projecteuclid.org/euclid.ss/1207580174">2007</a>). The Epic Story of Maximum Likelihood. <em>Statistical Science, 22</em>(4), 598-620.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>I don’t think anybody actually ever is interested in estimating the bias of a coin. In fact, one <em>cannot bias a coin</em> if we are only allowed to flip it in the usual manner (see Gelman & Nolan, <a href="https://www.tandfonline.com/doi/abs/10.1198/000313002605">2002</a>). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>In a wonderful paper humbly titled <em>The Epic Story of Maximum Likelihood</em>, Stigler (<a href="https://projecteuclid.org/euclid.ss/1207580174">2007</a>) says that maximum likelihood estimation must have been familiar even to hunters and gatherers, although they would not have used such fancy words, as the idea is exceedingly simple. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Strictly speaking, this is incorrect: the only norm that exists for the one-dimensional vector space is the absolute value norm. Thus, in our example with only one parameter $\theta$ there is no notion of an $\mathcal{L}_2$-norm. However, because of the analogy to the regression and more generally multidimensional setting, I hope that this inaccuracy is excused. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>It also shows that in the limit of infinite data, the posterior mean converges to the maximum likelihood estimate. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderRegularization is the process of adding information to an estimation problem so as to avoid extreme estimates. Put differently, it safeguards against foolishness. Both Bayesian and frequentist methods can incorporate prior information which leads to regularized estimates, but they do so in different ways. In this blog post, I illustrate these two different perspectives on regularization on the simplest example possible — estimating the bias of a coin. Modeling coin flips Let’s say that we are interested in estimating the bias of a coin, which we take to be the probability of the coin showing heads.1 In this section, we will derive the Binomial likelihood — the statistical model that we will use for modeling coin flips. Let $X \in [0, 1]$ be a discrete random variable with realization $X = x$. Flipping the coin once, let the outcome $x = 0$ correspond to tails and $x = 1$ to heads. We use the Bernoulli likelihood to connect the data to the latent parameter $\theta$, which we take to be the bias of the coin: There is no point in estimating the bias by flipping the coin only once. We are therefore interested in a model that can account for $n$ coin flips. If we are willing to assume that the individual coin flips are independent and identically distributed conditional on $\theta$, we can obtain the joint probability of all outcomes by multiplying the probability of the individual outcomes: For the purposes of estimating the coin’s bias, we actually do not care about the order in which the coins come up heads or tails; we only care about how frequently the coin shows heads or tails out of $n$ throws. Thus, we do not model the individual outcomes $X_i$, but instead model their sum $Y = \sum_{i=1}^n X_i$. We write: where we suppress conditioning on $n$ to not clutter notation. Note that our model is not complete — we need to account for the fact that there are several ways to get $y$ heads out of $n$ throws. For example, we can get $y = 2$ with $n = 3$ in three different ways: $(1, 1, 0)$, $(0, 1, 1)$, and $(1, 0, 1)$. If we were to use the model above, we would underestimate the probability of observing two heads out of three coin tosses by a factor of three. In general, there are $n!$ possible ways in which we can order the outcomes. To see this, think of $n$ containers. The first outcome can go in any container, the second one in any container but the container which houses the first outcome, and so on, which yields: However, we only care about $y$ of them, so we need to remove the remaining $(n - y)!$ possible ways. Moreover, once we have taken $y$ outcomes, we do not care about their order; thus we remove another $y!$ permutations. Therefore, for any particular sequence of coin flips of length $n$, there are ways to get $y$ heads out of $n$ throws. The funny looking symbol on the right is the Binomal coefficient. The probability of the data is therefore given by the Binomial likelihood: which just adds the term ${n \choose y}$ to the equation we had above after introducing $Y$. For the example of observing $y = 2$ heads out of $n = 3$ coin flips, the Binomial coefficient is ${3 \choose 2} = 3$, which accounts for the fact that there are three possible ways to get two heads out of three throws. The data Assume we flip the coin three times, $n = 3$, and observe three heads, $y = 3$. How can we estimate the bias of the coin? In the next sections, we will use the Binomial likelihood derived above and discuss three different ways of estimating the coin’s bias: maximum likelihood estimation, Bayesian estimation, and penalized maximum likelihood estimation. Classical estimation Within the frequentist paradigm, the method of maximum likelihood is arguably the most popular method for parameter estimation: choose as an estimate for $\theta$ the value which maximizes the likelihood of the data.2 To get a feeling for how the likelihood of the data differs across values of $\theta$, let’s pick two values, $\theta_1 = .5$ and $\theta_2 = 1$, and compute the likelihood of observing three heads out of three coin flips: We therefore conclude that the data are more likely for a coin that has bias $\theta_1 = 1$ than for a coin that has bias $\theta_2 = 0.5$. But is it the most likely value? To compare all possible values for $\theta$ visually, we plot the likelihood as a function of $\theta$ below. The left figure shows that, indeed, $\theta = 1$ maximizes the likelihood for the data. The right figure shows the likelihood function for $y = 15$ heads out of $n = 20$ coin flips. Note that, in contrast to probabilities, which need to sum to one, likelihoods do not have a natural scale. Do these two examples allow us to derive a general principle for how to estimate the bias of a coin? Let $\hat{\theta}$ denote an estimate of the population parameter $\theta$. The two figures above suggests that $\hat{\theta} = \frac{y}{n}$ is the maximum likelihood estimate for an arbitrary data set $d = (y, n)$ … and it is! To arrive at this mathematically, we can find the maximum of this likelihood function by taking the derivative with respect to $\theta$, and setting it to zero (see also a previous post). In other words, we solve for the value of $\theta$ for which the derivative does not change; and since the Binomial likelihood is unimodal, this maximum will be unique. Note the value for $\theta$ at which the likelihood function has its maximum does not change when we take logs, but because the mathematics is greatly simplified, we do so: which shows that indeed $\frac{y}{n}$ is the maximum likelihood estimate. Bayesian estimation Bayesians assign priors to parameters in addition to the likelihood, which takes a central role in all statistical paradigms. For this Binomial problem, we assign $\theta$ a Beta prior: As we will see below, this prior allows easy Bayesian updating while being sufficiently flexible in incorporating prior information. The figure below shows different Beta distributions, formalizing our prior belief about values of $\theta$. The figure in the top left corner assigns uniform prior plausibility to all values of $\theta$; the figures to its right incorporate a slight bias towards the extreme values $\theta = 1$ and $\theta = 0$. With increasing $a$ and $b,$ the prior becomes more biased towards $\theta = 0.5$; with decreasing $a$ and $b$, the prior becomes biased against $\theta = 0.5$. As shown in a previous blog post, the Beta distribution is conjugate to the Binomial likelihood, which means that the posterior distribution of $\theta$ is again a Beta distribution: where $a’ = a + y$ and $b’ = b + y - n$. Under this conjugate setup, the parameters of the prior can be understood as prior data; for example, if we choose prior parameters $a = b = 1$, then we assume that we have seen one heads and one tails prior to data collection. The figure below shows two examples of such Bayesian updating processes. In both cases, we observe $y = 3$ heads out of $n = 3$ coin flips. On the left, we assign $\theta$ a uniform prior. The resulting posterior distribution is proportional to the likelihood (which we have rescaled to fit nicely in the graph) and thus does not appear as a separate line. After we have seen the data, we can compute the posterior mode as our estimate for the most likely value of $\theta$. Observe that the posterior mode is equivalent to the maximum likelihood estimate: This is in fact the case for all statistical estimation problems where we assign a uniform prior to the (possibly high-dimensional) parameter vector $\theta$. To prove this, observe that: since we can drop the normalizing constant $p(y)$, because it does not depend on $\theta$, and $p(\theta)$, because it is a constant assigning all values of $\theta$ equal probability. Using a Beta prior with $a = b = 2$, as shown on the right side of the figure above, we see that the posterior is not proportional to the likelihood anymore. This in turn means that the mode of the posterior distribution does no longer correspond to the maximum likelihood estimate. In this case, the posterior mode is: In contrast to earlier, this estimate is shrunk towards $\theta = 0.5$. This came about because we have used prior information that stated that $\theta = 0.5$ is more likely than the other values (see figure with $a = b = 2$ above). Consequently, we were therefore less swayed by the somewhat unlikely situation (under no bias $\theta = 0.5$) of observing three heads out of three throws. It should thus not come as a surprise that Bayesian priors can act as regularizing devices. However, this requires careful application, especially in small sample size settings. In a Post Scriptum to this blog post, I similarly show how the posterior mean, which is arguably are more natural point estimate as it takes the uncertainty about $\theta$ better into account than the posterior mode, can be viewed as a regularized estimate, too. Penalized estimation Bayesians are not the only ones who can add prior information to an estimation problem. Within the frequentist framework, penalized estimation methods add a penalty term to the log likelihood function, and then find the parameter value which maximizes this penalized log likelihood. We can implement such a method by optimizing an extended log likelihood: where we penalize values that a far from the parameter value which indicates no bias, $\theta = 0.5$. The larger $\lambda$, the stronger values of $\theta \neq 0.5$ get penalized. In addition to picking $\lambda$, the particular form of the penalty term is also important. Similar to assigning $\theta$ a prior distribution, although possibly less straightforward and less flexible, choosing the penalty term means incorporating information about the problem in addition to specifying a likelihood function. Above, we have used the squared distance from $\theta = 0.5$ as a penalty. We call this the $\mathcal{L}_2$-norm penalty3, but the $\mathcal{L}_1$-norm, which takes the absolute distance, is an equally interesting choice: As we will see below, these penalties have very different effects. The penalized likelihood does not only depend on $\theta$, but also on $\lambda$. The code below evaluates the penalized log likelihood function given values for these two parameters. Note that we drop the normalizing constant ${n \choose y}$ as it does neither depend on $\theta$ nor on $\lambda$. I don’t think anybody actually ever is interested in estimating the bias of a coin. In fact, one cannot bias a coin if we are only allowed to flip it in the usual manner (see Gelman & Nolan, 2002). ↩ In a wonderful paper humbly titled The Epic Story of Maximum Likelihood, Stigler (2007) says that maximum likelihood estimation must have been familiar even to hunters and gatherers, although they would not have used such fancy words, as the idea is exceedingly simple. ↩ Strictly speaking, this is incorrect: the only norm that exists for the one-dimensional vector space is the absolute value norm. Thus, in our example with only one parameter $\theta$ there is no notion of an $\mathcal{L}_2$-norm. However, because of the analogy to the regression and more generally multidimensional setting, I hope that this inaccuracy is excused. ↩Variable selection using Gibbs sampling2019-03-31T13:00:00+00:002019-03-31T13:00:00+00:00https://fabiandablander.com/r/Spike-and-Slab<p>“Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using <em>spike-and-slab priors</em> and the <em>Gibbs sampler</em>, a computational method to sample from a joint distribution using only conditional distributions.</p>
<p>Variable selection is a beast. To slay it, we must draw on ideas from different fields. We have to discuss the basics of Bayesian inference which motivates our principal weapon, the Gibbs sampler. As an instruction manual, we apply it to a simple example: drawing samples from a bivariate Gaussian distribution (for pre-combat exercises, see <a href="https://fdabl.github.io/statistics/Two-Properties.html">here</a>). The Gibbs sampler feeds on conditional distributions. To be able to derive those easily, we need to equip ourselves with $d$-separation and directed acyclic graphs (DAGs). Having trained and become stronger, we attack variable selection in the linear regression case using Gibbs sampling with spike-and-slab priors. These priors are special in that they are a discrete mixture of a Dirac delta function — which can shrink regression coefficients exactly to zero — and a Gaussian distribution. We tackle the single predictor case first, and then generalize it to $p > 1$ predictors. For $p$ predictors, the Gibbs sampler with spike-and-slab priors yields a posterior distribution over all possible $2^p$ regression models, an enormous feat. From this, posterior inclusion probabilities and model-averaged parameter estimates follow straightforwardly. To wield this weapon in practice, we implement the method in R and engage in variable selection on simulated and real data. Seems like we have a lot to cover, so let’s get started!</p>
<h1 id="quantifying-uncertainty">Quantifying uncertainty</h1>
<p>Bayesian inference is an excellent tool for uncertainty quantification. Assume you have assigned a prior distribution to some parameter $\beta$ of a model $\mathcal{M}$, call it $p(\beta \mid \mathcal{M})$. After you have observed data $\mathbf{y}$, how should you update your belief to arrive at the posterior, $p(\beta \mid y, \mathcal{M})$? The rules of probability dictate:</p>
<script type="math/tex; mode=display">\underbrace{p(\beta \mid \mathbf{y}, \mathcal{M})}_{\text{Posterior}} = \underbrace{p(\beta \mid \mathcal{M})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{y} \mid \beta, \mathcal{M})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{y} \mid \beta, \mathcal{M}) \, p(\beta \mid \mathcal{M}) \, \mathrm{d} \beta}_{\text{Marginal Likelihood}}} \enspace .</script>
<p>The computationally easy parts of the right-hand side is the specification of the prior and, unless you do <a href="https://en.wikipedia.org/wiki/Approximate_Bayesian_computation">crazy things</a>, also the likelihood. The tough bit is the marginal likelihood or <em>normalizing constant</em> which, as the name implies, makes the posterior distribution integrate to one, as all proper probability distributions must. In contrast to differentiation, which is a local operation, integration is a global operation and is thus <a href="https://xkcd.com/2117/">much harder</a>. It becomes even harder with many parameters.</p>
<p>Usually, Bayes’ rule is given without conditioning on the model, $\mathcal{M}$. However, this assumes that we know one model to be true with certainty, thus ignoring the uncertainty we have about the models. We can apply Bayes’ rule not only on parameters, but also on models:</p>
<script type="math/tex; mode=display">p(\mathcal{M} \mid \mathbf{y}) = p(\mathcal{M}) \times \frac{p(\mathbf{y} \mid \mathcal{M})}{\sum_{i = 1}^m p(\mathbf{y} \mid \mathcal{M}_i) \, p(\mathcal{M}_i)} \enspace ,</script>
<p>where $m$ is the number of all models and</p>
<script type="math/tex; mode=display">p(\mathbf{y} \mid \mathcal{M}) = \int p(\mathbf{y} \mid \mathcal{M}, \beta) \, p(\beta \mid \mathcal{M}) \, \mathrm{d} \beta \enspace ,</script>
<p>is in fact the marginal likelihood of our first equation. To illustrate how one could do variable selection, assume we have two models, $\mathcal{M}_1$ and $\mathcal{M}_2$, which differ in their number of predictors:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{M}_2&: \mathbf{y} = \beta_0 + \beta_1 \mathbf{x}_1 \\[0.5em]
\mathcal{M}_4&: \mathbf{y} = \beta_0 + \beta_1 \mathbf{x}_1 + \beta_2 \mathbf{x}_2 \enspace .
\end{aligned} %]]></script>
<p>If these two are the only models we consider, then we can quantify their respective merits using posterior odds:</p>
<script type="math/tex; mode=display">\underbrace{\frac{p(\mathcal{M}_4 \mid \mathbf{y})}{p(\mathcal{M}_2 \mid \mathbf{y})}}_{\text{Posterior Odds}} = \underbrace{\frac{p(\mathcal{M}_4)}{p(\mathcal{M}_2)}}_{\text{Prior Odds}} \times \underbrace{\frac{p(\mathbf{y} \mid \mathcal{M}_4)}{p(\mathbf{y} \mid \mathcal{M}_2)}}_{\text{Bayes factor}} \enspace ,</script>
<p>where we can interpret the Bayes factor as an indicator for how much more likely the data are under $\mathcal{M}_4$, which includes $\beta_2$, compared to $\mathcal{M}_2$, which does not include $\beta_2$. However, two additional regression models are possible:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{M}_1&: \mathbf{y} = \beta_0\\[0.5em]
\mathcal{M}_3&: \mathbf{y} = \beta_0 + \beta_2 \mathbf{x}_2 \enspace .
\end{aligned} %]]></script>
<p>In general, if $p$ is the number of predictors, then there are $2^p$ possible regression models in total. If we ignore some of those a priori, we will have violated <em>Cromwell’s rule</em>, which states that we should never assign prior probabilities of zero to things that could possibly happen. Otherwise, regardless of the evidence, we would never change our mind. As Dennis Lindley put it, we should</p>
<blockquote>
<p>“[…] leave a little probability for the moon being made of green cheese; it can be as small as 1 in a million, but have it there since otherwise an army of astronauts returning with samples of the said cheese will leave you unmoved.” (Lindley, 1991, p. 101)</p>
</blockquote>
<p>One elegant aspect about the Bayes factor is that we do not need to compute the normalizing constant of all models (it cancels in the ratio), which would require us to enumerate and assign priors to all possible models. If we are willing to do this, however, then we can model-average to get a posterior distribution of $\beta_j$ that takes into account the uncertainty about all $m$ models:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta_j \mid \mathbf{y}) &= \sum_{i=1}^m \, p(\beta_j, \mathcal{M}_i \mid \mathbf{y}) \\[.5em]
&= \sum_{i=1}^m \, p(\beta_j \mid \mathbf{y}, \mathcal{M}_i) \, p(\mathcal{M}_i \mid \mathbf{y}) \enspace ,
\end{aligned} %]]></script>
<p>which requires computing the posterior distribution over the parameter of interest $\beta_j$ in each model $\mathcal{M}_j$, as well as the posterior distribution over all such models. Needless to say, this is a difficult problem; the bulk of this blog post is to find an efficient way to do this in the context of linear regression models. For variable selection, we might be interested in another quantity: the posterior probability that $\beta_j \neq 0$, averaged over all models. We can arrive at this by similar means:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta_j \neq 0 \mid \mathbf{y}) & = \sum_{i=1}^m \, p(\beta_j \neq 0, \mathcal{M}_i \mid \mathbf{y}) \\[.5em]
&= \sum_{i=1}^m \, p(\beta_j \neq 0 \mid \mathbf{y}, \mathcal{M}_i) \, p(\mathcal{M}_i \mid \mathbf{y}) \enspace .
\end{aligned} %]]></script>
<p>Note that conditional on a model $\mathcal{M}_i$, $\beta_j$ is either zero or not zero. Therefore, all the terms in which $\beta_j$ is zero drop out of the sum, and we are left with summing the posterior model probabilities for the models in which $\beta_j \neq 0$. This model-averaging perspective strikes me as a very elegant approach to variable selection.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> In the remainder of this blog post, we will solve this variable selection problem for linear regression using the Gibbs sampler with spike-and-slab priors.</p>
<h1 id="gibbs-sampling">Gibbs sampling</h1>
<p>Much of the advent in Bayesian inference in the last few decades is due to methods that arrive at the posterior distribution without calculating the marginal likelihood. One such method is the Gibbs sampler, which breaks down a high-dimensional problem into a number of smaller low-dimensional problems. It’s really one of the coolest things in statistics: it samples from the joint posterior distribution and its marginals by sampling from the conditional posterior distributions. To prove that it works mathematically is not trivial, and beyond this already lengthy introductory blog post.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> Thus, instead of getting bogged down in the technical details, let’s take a look at a motivating example.</p>
<h2 id="sampling-from-a-bivariate-gaussian">Sampling from a bivariate Gaussian</h2>
<p>To illlustrate, let $X_1$ and $X_2$ be bivariate normally distributed random variables with population mean zero ($\mu_1 = \mu_2 = 0$), unit variance ($\sigma_1^2 = \sigma_2^2 = 1$), and correlation $\rho$. As you may recall from a <a href="https://fdabl.github.io/statistics/Two-Properties.html">previous</a> blogpost, the conditional Gaussian distribution of $X_1$ given $X_2 = x_2$, and $X_2$ given $X_1 = x_1$, respectively, are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
X_1 \mid X_2 = x_2 &\sim \mathcal{N}\left(\rho x_2, \, (1 - \rho^2)\right) \\[0.5em]
X_2 \mid X_1 = x_1 &\sim \mathcal{N}\left(\rho x_1, \, (1 - \rho^2)\right) \enspace .
\end{aligned} %]]></script>
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- x_2^1 &\sim \mathcal{N}(0, 1) \\[.5em] -->
<!-- x_1^2 &\sim \mathcal{N}\left(\rho x_2^1, \, (1 - \rho^2)\right) \\[0.5em] -->
<!-- x_2^2 &\sim \mathcal{N}\left(\rho x_1^2, \, (1 - \rho^2)\right) \\[0.5em] -->
<!-- x_1^3 &\sim \mathcal{N}\left(\rho x_2^2, \, (1 - \rho^2)\right) \\[0.5em] -->
<!-- x_2^3 &\sim \mathcal{N}\left(\rho x_1^3, \, (1 - \rho^2)\right) \\[0.5em] -->
<!-- \vdots &\sim \vdots \\[.5em] -->
<!-- \end{aligned} -->
<!-- $$ -->
<p>The Gibbs sampler makes it so that if we sample repeatedly from these two conditional distributions:</p>
<script type="math/tex; mode=display">(x_1^1, x_2^1), (x_1^2, x_2^2), \ldots, (x_1^{n - 1}, x_2^{n - 1}), (x_1^n, x_2^n) \enspace ,</script>
<p>then these will be samples from the joint distribution $p(X_1, X_2)$ and its marginals.</p>
<!-- The astounding thing with Gibbs sampling is that, if we sample $x_1^t$ from the conditional distribution $p(X_1^t \mid X_2 = x_2^{t-1})$ and $x_2^t$ from the conditional distribution $p(X_2^t \mid X_1 = x_1^{t-1})$, then under some regularity conditions, the joint samples will be from the bivariate Gaussian distribution! -->
<p>To illustrate, we implement this Gibbs sampler in R.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">sample_bivariate_normal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">rho</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">rho</span><span class="o">*</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">rho</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="c1"># sample from p(X1 | X2 = x2)</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">rho</span><span class="o">*</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="p">,</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">rho</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="c1"># sample from p(X2 | X1 = x1)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">x</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Let’s see it in action:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sample_bivariate_normal</span><span class="p">(</span><span class="n">rho</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">)</span><span class="w">
</span><span class="n">cov</span><span class="p">(</span><span class="n">samples</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1] [,2]
## [1,] 1.0178545 0.5091747
## [2,] 0.5091747 0.9949518</code></pre></figure>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>Wait a minute, you might say. In this toy example, what was the prior distribution, and which posterior did we compute? The answer is: there were none! We have used the Gibbs sampler not to learn about a parameter, but rather to illustrate that sampling from conditional distributions in this way results in samples from the joint distribution. In the next section, we look at how graphs can help us in finding conditional independencies.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<!-- Although you could get philosophical and ask: what exactly is the [difference](https://www.tandfonline.com/doi/abs/10.1080/15366360802035497) between 'data' (here $x_1$ and $x_2$) and parameters (here $\rho$)? -->
<!-- ## Example II: Another thing -->
<!-- The example above is a little fishy: in the Gaussian case, if we know both conditional distributions, then we also [know the joint distribution](https://fdabl.github.io/statistics/Two-Properties.html)! -->
<h1 id="conditional-independence-and-graphs">Conditional independence and graphs</h1>
<p>Before we look into variable selection using spike-and-slab priors in the linear regression case, we need to get some preliminaries about conditional independence out of the way. We write:</p>
<script type="math/tex; mode=display">X \perp Y \hspace{.4em} \vert\, Z \enspace ,</script>
<p>to denote that $X$ and $Y$ are <em>conditionally independent</em> given $Z$.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup> We can visualize conditional independencies between random variables using directed acyclic graphs (DAGs). The figure below distinguishes between three different DAG structures.</p>
<p><img src="../assets/img/DAGs-SS.png" /></p>
<p>DAG (a) above is a <em>common cause</em> structure. A good example is the positive correlation between the number of storks and the number of human babies delivered; these two variables become independent once one conditions on the common cause <em>economic development</em> (Matthews, 2001). DAG (b) is an example where the effect of $X$ on $Y$ is <em>fully mediated</em> by $Z$: conditional on $Z$, $X$ does not have an effect on $Y$. Thus, both in DAGs (a) and (b), conditioning on $Z$ renders $X$ and $Y$ independent.</p>
<p>Two variables can also be <em>marginally independent</em>, for which we write:</p>
<script type="math/tex; mode=display">X \perp Y \enspace ,</script>
<p>which holds in DAG (c). Note, however, that if we would condition on $Z$ in DAG (c), then $X$ and $Y$ would become <em>dependent</em>. $Z$ is a <em>collider</em>, and conditioning on it induces a dependency between $X$ and $Y$. Although not visible in the DAG, a dependency would also have been induced btween $X$ and $Y$ if we had conditioned on any children of $Z$.</p>
<!-- There are various good examples of so-called *collider bias*; for example: ... -->
<p>Note that although we visualize the conditional independencies in a DAG, we do not interpret it causally. We are merely interested in <em>seeing</em>, not <em>doing</em>, and view the arrows as “incidental construction features supporting the
$d$-separation semantics” (Dawid, 2010 p. 90).</p>
<!-- From $d$-separation, we can distill the following factorization for the conditional probability of a node: -->
<!-- $$ -->
<!-- p(X \mid \mathcal{G} \setminus \{X\}) \propto \,p(X \mid \text{Pa}(X)) \, \prod_{Y \in \text{Ch(X)}} p(Y \mid \text{Pa(Y)}) \enspace . -->
<!-- $$ -->
<p>As we will see in the next section, being able to read conditional independencies from a graph greatly aids in finding conditional distributions feeding the Gibbs sampler.</p>
<h1 id="spike-and-slab-regression">Spike-and-Slab Regression</h1>
<h2 id="model-specification">Model specification</h2>
<p>In a previous blog post, we discussed the (history of the) methods of least squares and <a href="https://fdabl.github.io/statistics/Curve-Fitting-Gaussian.html">linear regression</a>. However, we did not assess whether a particular variable $X$ is actually associated with an outcome $Y$. We can think of this problem as hypothesis testing, variable selection, or structure learning. In particular, we may write the regression model with a single predictor variable as:</p>
<script type="math/tex; mode=display">y_i \sim \mathcal{N}(\beta \, x_i , \sigma_e^2) \enspace .</script>
<p>We put the following prior on $\beta$:</p>
<script type="math/tex; mode=display">\beta \sim (1 - \pi) \, \delta_0 + \pi \, \mathcal{N}(0, \sigma_y^2 \tau^2) \enspace ,</script>
<p>where $\pi \in [0, 1]$ is a mixture weight, $\sigma_y^2$ is the variance of $\mathbf{y}$, $\delta_0$ is the <a href="https://en.wikipedia.org/wiki/Dirac_delta_function">Dirac delta function</a> (the <em>spike</em>), and $\tau^2$ is the variance of the <em>slab</em>. We multiply $\tau^2$ with $\sigma_y^2$ so that the prior naturally scales with the scale of the outcome. If we would not do this, then our results would depend on the measurement units of $\mathbf{y}$. Instead of fixing $\tau^2$ to a constant, we learn it by specifying</p>
<script type="math/tex; mode=display">\tau^2 \sim \text{Inverse-Gamma}(1/2, s^2/2) \enspace ,</script>
<p>which results in a scale-mixture of Gaussians, that is, a Cauchy distribution with scale $s$. The figure below visualizes the marginal prior on $\beta$ as a discrete mixture ($\pi = 0.5$) of a Dirac delta, a Cauchy with scale $s = 1/2$, and $\sigma_y^2 = 1$.</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>The idea behind this specification is to allow the regression weight $\beta$ to be <em>exactly</em> zero. Using Gibbs sampling, we will arrive at $p(\pi \mid y)$ which indicates the posterior probability of the parameter $\beta$ being zero. We continue the prior specification with</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\pi &\sim \text{Bern}(\theta) \\[0.5em]
\theta &\sim \text{Beta}(a, b) \\[0.5em]
\sigma_e^2 &\sim \text{Inverse-Gamma}(\alpha_1, \alpha_2) \enspace ,
\end{aligned} %]]></script>
<p>where we set $a = b = 1$ and $\alpha_1 = \alpha_2 = 2$. We can visualize the relations between all random variables in a DAG, see the figure below. Nodes with a grey shadow are observed or set by us, white nodes denote random variables.</p>
<p><img src="../assets/img/SS-GM.png" /></p>
<p>Using $d$-separation as introduced in the previous section, we note that this larger graph is basically a collection of DAGs (b) and (c). This helps us see that the joint probability distribution factors:</p>
<script type="math/tex; mode=display">p(\mathbf{y}, \beta, \pi, \theta, \tau^2, \sigma_e^2) = p(\mathbf{y} \mid \beta, \sigma_e^2) \, p(\sigma_e^2) \, p(\beta \mid \pi, \tau^2) \, p(\pi \mid \theta) \, p(\theta) \, p(\tau^2) \enspace ,</script>
<p>where we have suppressed conditioning on the hyperparameters $a = b = 1$, $\alpha_1 = \alpha_2 = 0.01$, $s = 1/2,$ the predictor variables $X$, and the variance of the outcome $\sigma_y^2$.</p>
<p>For the Gibbs sampler, we need the conditional posterior distribution of each parameter given the data and all other parameters. Using the conditional independence structure of the graph, this results in the following conditional distributions:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&p(\theta \mid \mathbf{y}, \beta, \pi, \tau^2, \sigma_e^2 ) = p(\theta \mid \pi) \\[0.5em]
&p(\tau^2 \mid \mathbf{y}, \beta, \pi, \theta, \sigma_e^2) = p(\tau^2 \mid \beta, \pi) \\[0.5em]
&p(\sigma_e^2 \mid \mathbf{y}, \beta, \pi, \theta, \tau^2) = p(\sigma_e^2 \mid \mathbf{y}, \beta) \\[0.5em]
&p(\pi \mid \mathbf{y}, \beta, \theta, \tau^2, \sigma_e^2) = p(\pi \mid \beta, \theta, \tau^2) \\[0.5em]
&p(\beta \mid \mathbf{y}, \pi, \theta, \tau^2, \sigma_e^2) = p(\beta \mid \mathbf{y}, \pi, \tau^2, \sigma_e^2) \enspace .
\end{aligned} %]]></script>
<!-- These conditional independencies result in *local computation*: certain parts are shielded from other parts of the graph. The shield is called the *Markov blanket*. Information trickles through the graph from node to node. -->
<p>In the next sections, we derive these conditional posterior distributions in turn. Since the single predictor case is slightly simpler to follow, we focus on it. However, the generalization to the multiple predictor setting is relatively straightforward, and I will sketch it afterwards.</p>
<h2 id="conditional-posterior-ptheta-mid-pi">Conditional posterior $p(\theta \mid \pi)$</h2>
<p>We expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\theta \mid \pi) &= \frac{p(\pi \mid \theta) \, p(\theta)}{\int p(\pi \mid \theta) \, p(\theta) \, \mathrm{d}\theta} \\[.5em]
&=\frac{\theta^\pi (1 - \theta)^{1 - \pi} \frac{1}{B(a,b)} \theta^{a - 1} (1 - \theta)^{b - 1}}{\int \theta^\pi (1 - \theta)^{1 - \pi} \frac{1}{B(a,b)} \theta^{a - 1} (1 - \theta)^{b - 1} \, \mathrm{d}\theta} \\[.5em]
&= \frac{\theta^{(a + \pi) - 1} (1 - \theta)^{(b + 1 - \pi) - 1}}{\int \theta^{(a + \pi) - 1} (1 - \theta)^{(b + 1 - \pi) - 1} \, \mathrm{d}\theta} \enspace ,
\end{aligned} %]]></script>
<p>where $B$ is the <a href="https://en.wikipedia.org/wiki/Beta_function">beta function</a>, and where we realize the numerator is the <em>kernel</em> of a Beta distribution, and the denominator is the normalizing constant. Thus, the posterior is again a Beta distribution:</p>
<script type="math/tex; mode=display">\theta \mid \pi \sim \text{Beta}(a + \pi, b + 1 - \pi) \enspace .</script>
<p>As we can see, the conditional posterior of $\theta$ only depends on $\pi$. That means, however, that we can never get much information about this parameter, as $\pi$ can only be 0 or 1, and so the Beta distribution can only become $\text{Beta}(2, 1)$ or $\text{Beta}(1, 2)$ with a uniform prior $a = b = 1$. The posterior mean of $\theta$ can thus never become larger than $2/3$ or smaller than $1/3$.</p>
<h2 id="conditional-posterior-ptau2-mid-beta-pi">Conditional posterior $p(\tau^2 \mid \beta, \pi)$</h2>
<p>The conditional posterior on $\tau^2$ also depends on $\pi$ because conditioning on $\beta$ means conditioning on a collider, inducing the dependency. We expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\tau^2 \mid \beta, \pi) &= \frac{p(\beta \mid \tau^2, \pi) \, p(\pi) \, p(\tau^2)}{\int p(\beta \mid \tau^2, \pi) \, p(\pi) \, p(\tau^2) \, \mathrm{d}\tau^2} \\[.5em]
&= \frac{p(\beta \mid \tau^2, \pi) \, p(\tau^2)}{\int p(\beta \mid \tau^2, \pi) \, p(\tau^2) \, \mathrm{d}\tau^2} \enspace .
\end{aligned} %]]></script>
<p>To make the notation less cluttered, we will call the normalizing constant in this and all following derivations $Z$. Note that terms that do not depend on the parameter of interest in the numerator cancel, as the same terms appear in the normalizing constant. Further note that $\pi$ can be either 0 or 1. We first tackle the $\pi = 1$ case and write</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\tau^2 \mid \beta, \pi = 1) &= \frac{1}{Z} \, p(\beta \mid \tau^2, \pi) \, p(\tau^2) \\[0.5em]
&= \frac{1}{Z} \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right) \frac{\left(\frac{s^2}{2}\right)^{\frac{1}{2}}}{\Gamma\left(\frac{1}{2}\right)} \left(\tau^2\right)^{- \frac{1}{2} - 1} \text{exp}\left(-\frac{\frac{s^2}{2}}{\tau^2}\right) \enspace ,
\end{aligned} %]]></script>
<p>where $\Gamma$ is the <a href="https://en.wikipedia.org/wiki/Gamma_function">gamma function</a>. Absorbing everything that does not depend on $\tau^2$ into the normalizing constant, we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\tau^2 \mid \beta, \pi = 1) &= \frac{1}{Z} \, \left(\tau^2\right)^{-\frac{1}{2} - 1 -\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 - \frac{\frac{s^2}{2}}{\tau^2} \right) \\[0.5em]
&= \frac{1}{Z} \, \left(\tau^2\right)^{-\left(\frac{1}{2} + \frac{1}{2}\right) - 1} \text{exp}\left(-\frac{\left(\frac{s^2}{2} + \frac{\beta^2}{2\sigma_y^2}\right)}{\tau^2}\right) \enspace ,
\end{aligned} %]]></script>
<p>which is a new inverse Gamma distribution:</p>
<script type="math/tex; mode=display">\tau^2 \mid \beta , \pi = 1 \sim \text{Inverse-Gamma}\left(\frac{1}{2} + \frac{1}{2}, \frac{s^2}{2} + \frac{\beta^2}{2\sigma_y^2}\right) \enspace .</script>
<p>On the other hand, if $\pi = 0$, then $\beta = 0$ and we simply sample from the prior:</p>
<script type="math/tex; mode=display">\tau^2 \mid \beta , \pi = 0 \sim \text{Inverse-Gamma}\left(\frac{1}{2}, \frac{s^2}{2}\right) \enspace .</script>
<p>Because the derivation is very similar, we look at the conditional posterior $p(\sigma_e^2 \mid y, \beta)$ next.</p>
<h2 id="conditional-posterior-psigma_e2-mid-y-beta">Conditional posterior $p(\sigma_e^2 \mid y, \beta)$</h2>
<p>Again writing the normalizing constant as $Z$, we expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\sigma_e^2 \mid \mathbf{y}, \beta) &= \frac{1}{Z} \, p(\mathbf{y} \mid \beta, \sigma_e^2)\, p(\beta) \, p(\sigma_e^2) \\[1em]
&= \frac{1}{Z} \, \left(2\pi\sigma_e^2\right)^{-n/2} \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - bx_i\right)^2 \right) \frac{\alpha_2^{\alpha_1}}{\Gamma(\alpha_1)} \left(\sigma_e^2\right)^{- \alpha_1 - 1} \text{exp} \left(-\frac{\alpha_2}{\sigma_e^2}\right) \enspace ,
\end{aligned} %]]></script>
<p>which looks very similar to the conditional posterior on $\tau^2$. In fact, using the same tricks as above — absorbing terms that do not depend on $\sigma_e^2$ into $Z$, and putting terms together — we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\sigma_e^2 \mid \mathbf{y}, \beta) &= \frac{1}{Z} \, \left(\sigma_e^2\right)^{-\frac{n}{2}} \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - bx_i\right)^2 \right) \left(\sigma_e^2\right)^{- \alpha_1 - 1} \text{exp} \left(-\frac{\alpha_2}{\sigma_e^2}\right) \\[1em]
&= \frac{1}{Z} \, \left(\sigma_e^2\right)^{-\left(\alpha_1 + \frac{n}{2}\right) - 1} \text{exp} \left(-\frac{1}{\sigma_e^2} \left[\alpha_2 + \frac{\sum_{i=1}^n(y_i - bx_i)^2}{2}\right]\right) \enspace ,
\end{aligned} %]]></script>
<p>which is again an inverse Gamma distribution:</p>
<script type="math/tex; mode=display">\sigma_e^2 \mid \mathbf{y}, \beta \sim \text{Gamma}\left(\alpha_1 + \frac{n}{2}, \alpha_2 + \frac{\sum_{i=1}^n(y_i - \beta x_i)^2}{2}\right) \enspace .</script>
<p>Contrasting this derivation with the one above, we note something interesting. Our belief about the variance $\sigma_e^2$ gets updated using the $n$ data points $\mathbf{y}$, whereas our belief about $\tau^2$ gets updated using only $\beta$. “In the Bayesian framework, the difference between data and parameters is fuzzy”, McElreath points out (2016, p. 34); or, put even more strongly, Dawid (1979, p.1): “[…] the distinction between data and parameters is largely irrelevant”.</p>
<p>Because the conditional posterior of $\pi$ is quite tricky, we continue with the conditional posterior of $\beta$.</p>
<h2 id="conditional-posterior-pbeta-mid-y-pi-tau2-sigma_e2">Conditional posterior $p(\beta \mid y, \pi, \tau^2, \sigma_e^2)$</h2>
<p>The conditional posterior of $\beta$ given $\pi = 0$ is easy: it is the Dirac delta function $\delta_0$, from which samples will always have value 0. The conditional posterior for $\pi = 1$ is a little more complicated to derive, but not by much. We start by writing:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{p(\mathbf{y} \mid \beta, \pi = 1, \tau^2, \sigma_e^2) \, p(\beta \mid \pi = 1, \tau^2) \, p(\pi = 1) \, p(\tau^2)}{\int p(\mathbf{y} \mid \beta, \pi = 1, \tau^2, \sigma_e^2) \, p(\beta \mid \pi = 1) \, p(\pi = 1) \, p(\tau^2) \, \mathrm{d} \beta} \\[1em]
&= \frac{p(\mathbf{y} \mid \beta, \pi = 1, \tau^2, \sigma_e^2) \, p(\beta \mid \pi = 1)}{\int p(\mathbf{y} \mid \beta, \pi = 1, \tau^2, \sigma_e^2) \, p(\beta \mid \pi = 1) \, \mathrm{d} \beta} \enspace ,
\end{aligned} %]]></script>
<p>where we again write the normalizing constant as $Z$. Expanding, we get:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{1}{Z} \, \left(2\pi\sigma_e^2\right)^{-n/2} \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - \beta x_i\right)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-1/2} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \enspace .
\end{aligned} %]]></script>
<p>We can again absorb terms that do not depend on $\beta$ into $Z$. We proceed:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - \beta x_i\right)^2 \right) \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(y_i - \beta x_i\right)^2 -\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \left[\sum_{i=1}^n \left(y_i - \beta x_i\right)^2 +\frac{2\sigma_e^2}{2\sigma_y^2\tau^2} b^2 \right]\right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \left[\sum_{i=1}^n y_i^2 - 2\beta\sum_{i=1}^n y_i x_i + \beta^2 \sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2} \beta^2 \right]\right) \enspace .
\end{aligned} %]]></script>
<p>We can further absorb the $\sum_{i=1}^n y_i^2$ term into $Z$ and put the $\beta^2$ terms together. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{1}{Z} \, \text{exp} \left(-\frac{1}{2\sigma_e^2} \left[\beta^2\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right) - 2\beta\sum_{i=1}^n y_i x_i\right]\right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left[\beta^2 - \frac{2\beta\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right]\right) \enspace .
\end{aligned} %]]></script>
<p>If you have followed my previous blog post (see <a href="https://fdabl.github.io/statistics/Two-Properties.html">here</a>), then you might guess what comes next: completing the square! We expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi = 1, \tau^2, \sigma_e^2) &= \frac{1}{Z} \, \text{exp} \left(-\frac{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left[\left(\beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2 - \left(\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2\right]\right) \\[0.5em]
&= \frac{1}{Z} \, \text{exp} \left(-\frac{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left(\beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 +\frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2\right)
\enspace ,
\end{aligned} %]]></script>
<p>where have absorbed the last term into the normalizing constant $Z$ because it does not depend on $\beta$. Note that this is the <em>kernel</em> of a Gaussian distribution, which completes our ordeal — which we both enjoy, admit it! — resulting in:</p>
<script type="math/tex; mode=display">% <![CDATA[
\beta \mid \mathbf{y}, \pi, \tau^2, \sigma_e^2 \sim \begin{cases}
\delta_0 & \hspace{1em} \text{if} \hspace{1em} \pi = 0 \\
\mathcal{N}\left(\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2} \right)}, \frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2} \right)}\right) & \hspace{1em} \text{if} \hspace{1em} \pi = 1
\end{cases} %]]></script>
<p>Note again that we take samples from $\delta_0$ to always be zero.</p>
<h2 id="conditional-posterior-ppi-mid-beta-theta-tau2-first-attempt">Conditional posterior $p(\pi \mid \beta, \theta, \tau^2)$: First attempt</h2>
<p>Applying $d$-separation, the graph tells us that $\pi$ is independent of $\mathbf{y}$ given $\beta$:</p>
<script type="math/tex; mode=display">\pi \perp \mathbf{y} \hspace{.4em} \vert\, \beta \enspace .</script>
<p>This means we can expand in the following way:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi \mid \beta, \tau^2, \theta) &=\frac{p(\beta \mid \pi, \tau^2, \theta) \, \, p(\pi \mid \theta) \, p(\theta) \, p(\tau^2)}{p(\beta \mid \pi = 1, \tau^2, \theta) \, \, p(\pi = 1 \mid \theta) \, p(\theta) \, p(\tau^2) + p(\beta \mid \pi = 0, \tau^2, \theta) \, \, p(\pi = 0 \mid \theta) \, p(\theta) \, p(\tau^2)} \\[1em]
&=\frac{p(\beta \mid \pi, \tau^2, \theta) \, \, p(\pi \mid \theta)}{p(\beta \mid \pi = 1, \tau^2, \theta) \, \, p(\pi = 1 \mid \theta) + p(\beta \mid \pi = 0, \tau^2, \theta) \, \, p(\pi = 0 \mid \theta)}
\enspace ,
\end{aligned} %]]></script>
<p>where we could again cancel terms that were common to both the numerator and denominator. From this, it may come as a surprise that this conditional posterior should be harder than the other ones. Let’s tackle the cases where $\pi = 0$ and $\pi = 1$ in turn; the normalizing constant $Z$ is simply their sum.</p>
<p>We start with $\pi = 1$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 1 \mid \beta, \tau^2, \theta) &= \frac{1}{Z} \, p(\beta \mid \pi = 1, \tau^2, \theta) \, \, p(\pi = 1 \mid \theta) \\[1em]
&= \frac{1}{Z} \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta \enspace ,
\end{aligned} %]]></script>
<p>which looks perfectly reasonable. If $\pi = 0$, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 0 \mid \beta, \tau^2, \theta) &= \frac{1}{Z} \, p(\beta \mid \pi = 0, \tau^2, \theta) \, \, p(\pi = 0 \mid \theta) \\[1em]
&= \frac{1}{Z} \, \delta_0 \, (1 - \theta) \enspace ,
\end{aligned} %]]></script>
<p>which looks peculiar. To see how this bites us, we note that:</p>
<script type="math/tex; mode=display">\pi \mid \beta, \tau^2, \theta \sim \text{Bern}\left(\frac{\left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta}{\left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta + \delta_0 \, (1 - \theta)}\right) \enspace .</script>
<p>The issue with this is as follows. Remember that, in the Gibbs sampler, we sample from this conditional posterior using previous samples of $\beta$, $\tau^2$, and $\theta$ — call them $\beta^{\small{\star}}$, $\tau^{2\small{\star}}$, and $\theta^{\small{\star}}$, respectively. The previous value $\beta^{\small{\star}}$ depends on the previous sample for $\pi$, denoted $\pi^{\small{\star}}$, such that if $\pi^{\small{\star}} = 0$ then $\beta^{\small{\star}} = 0$. If this happens in the sampling process — and it will — then we have to evaluate $\delta_0\left(\beta^{\small{\star}}\right)$ which puts infinite mass on $\beta^{\small{\star}} = 0$. This means that the ratio above will become zero, resulting in a new draw for $\pi$ that is $\pi^{\small{\star}} = 0$. However, this in turn means that the new value for $\beta$ will be $\beta^{\small{\star}} = 0$, and the whole spiel repeats. The Gibbs sampler thus gets forever stuck in the region $\beta = 0$, which means that the Markov chain will not converge to the joint posterior distribution.</p>
<p>Before we go back to the drawing board, one might suggest that we could simply set $\delta_0 = 1$, and then carry out the computation needed to draw from the conditional posterior of $\pi$. It runs into the following issue, however. Let $\xi$ be the chance parameter which governs the Bernoulli from which we draw $\pi$. With $\delta_0 = 1$, we have:</p>
<script type="math/tex; mode=display">\xi = \frac{\left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta}{\left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2\right)\theta + (1 - \theta)} \enspace .</script>
<p>Now let us assume the previous draw of $\beta$ was $\beta^{\small{\star}} = 0$. For simplicity, let $\theta = \frac{1}{2}$ and $\sigma_y^2 = 1$. This leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\xi &= \frac{\left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\tau^2} 0^2\right)\frac{1}{2}}{\left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp} \left(-\frac{1}{2\tau^2} 0^2\right)\frac{1}{2} + \frac{1}{2}} \\
&= \frac{\left(2\pi\tau^2\right)^{-\frac{1}{2}}}{\left(2\pi\tau^2\right)^{-\frac{1}{2}} + 1} \enspace ,
\end{aligned} %]]></script>
<p>which can never become zero, regardless of the data! If $\tau^2 = 1$, for example, then $\xi = 0.285$. Recall that $\tau^2$ is the variance of the prior assigned to $\beta$. The only way for $\xi$ to become zero, i.e., overwhelmingly support the model in which $\beta = 0$, is for $\tau^2$ to become very, very large. This is known as the Jeffreys-Bartlett-Lindley paradox<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>, and it makes sense: a model which assigns all possible values for $\beta$ similar plausibility will make poor predictions. If we had set $\tau^2$ by hand, then we could (artificially) achieve strong support for the null model (not that this is desirable!). However, we have assigned $\tau^2$ a prior, learning its value from data, and so this will practically never happen. Thus, even though $\xi$ approaches zero more closely the larger $\tau^2$, we will effectively never find strong support for the model in which $\beta = 0$.</p>
<p>In sum, we have tried two things to work with the Dirac delta function: (a) take it at face value, and (b) have it return 1 instead of <em>Inf</em>. The first approach lead to our Gibbs sampler getting stuck, only sampling values $\beta = 0$. The second approach lead to a situation in which we will always find bounded support for the model in which $\beta = 0$, regardless of the data. From this, we can easily draw the conclusion that working with the Dirac delta function is a pain! One might therefore be tempted to suggest to <em>stop being so discrete</em>: instead of $\delta_0$, use another Gaussian with a very small variance. This in fact solves the issue, because then instead of evaluating $\delta_0\left(\beta^{\small{\star}}\right)$, which puts infinite mass on $\beta^{\small{\star}} = 0$, we compute the density of $\beta^{\star}$ under a Gaussian distribution; even though it has small variance, it certainly will not return <em>Inf</em>. This is actually the approach by George & McCulloch (19993), who proposed the spike-and-slab prior setup under the name of <em>Stochastic Search Variable Selection</em>. Two issues remain: it may be difficult to choose this small variance in practice, and if it is very small, the Gibbs sampler will still be inefficient. Thus, we have to find another way to get rid of $\delta_0$.</p>
<!-- Yeah, we could do that. However, there are two issues. First, this would mean that we have to choose the variance of the second Gaussian distribution, indicating what "effect size" we deem negligible. This is difficult. Moreover, if the variance is very small, then the Gibbs sampler will still be inefficient. Yes, we could do Hamiltonian Monte Carlo with Stan, but this would be another blog post. The second issue is that, god damn it, sometimes you gotta do what you gotta do. Sure, we could *simplify* the problem, but do we really want to? Is that how NASA put people on the moon? How homo sapiens conquered the world coming from Africa? Do you think anybody ever got anywhere with saying "naaah, this is too hard"? So let's go back to that fucking drawing board and figure this shit out! -->
<h2 id="conditional-posterior-ppi-mid-beta-theta-tau2-second-attempt">Conditional posterior $p(\pi \mid \beta, \theta, \tau^2)$: Second attempt</h2>
<p>In mathematics, it sometimes helps to write things down in a more complicated manner. In our case, we can do so by conditioning on $\mathbf{y}$ and $\sigma_e^2$, even though $\pi$ is independent of them given $\beta$. This might help because we get another likelihood term with which we can play with. We again start with $\pi = 1$, yielding:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 1 \mid \mathbf{y}, \sigma_e^2, \beta, \tau^2, \theta)
&= \frac{1}{Z} \, p(\mathbf{y} \mid \sigma_e^2, \pi = 1, \tau^2, \theta, \beta) \, \, p(\beta \mid \tau^2, \pi = 1, \theta) \, p(\pi = 1 \mid \theta) \\[1em]
&= \frac{1}{Z} \,\left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \theta \enspace .
\end{aligned} %]]></script>
<p>The case where $\pi = 0$ yields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 0 \mid \mathbf{y}, \sigma_e^2, \beta, \tau^2, \theta)
&= \frac{1}{Z} \, p(\mathbf{y} \mid \sigma_e^2, \pi = 0, \tau^2, \theta, \beta) \, \, p(\beta \mid \tau^2, \pi = 0, \theta) \, p(\pi = 0 \mid \theta) \\[1em]
&= \frac{1}{Z} \,\left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \delta_0 (1 - \theta) \enspace .
\end{aligned} %]]></script>
<p>Argh! It did not work. Observe that again $\pi$ would be drawn from a Bernoulli, but with a more complicated chance parameter $\xi$ than above:</p>
<script type="math/tex; mode=display">\pi \mid \mathbf{y}, \sigma_e^2, \beta, \tau^2, \theta \sim \text{Bern}\left(\frac{\text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \theta}{\text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \theta + \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \delta_0 (1 - \theta)}\right) \enspace ,</script>
<p>where the $\left(2\pi\sigma_e^2\right)^{-\frac{n}{2}}$ term cancels. Still, the denominator features the unholy Dirac delta function $\delta_0$ — the bane of our existence — and we run into the same issue as above.</p>
<p>Exhausted, we ask: should we not try to use a continuous spike instead of the discontinuous Dirac delta? No — let us not give up just yet! I was a bit surprised, however, by how difficult it was to find literature that talked about how to handle the Dirac spike. For example, in a review of Bayesian variable selection methods, O’Hara & Sillanpää (2009) mention the continuous but not the discontinuous spike-and-slab setting. I eventually did find a useful reference (Geweke, 1996) through the paper by George & McCulloch (1997). Motivated by the fact that this problem is indeed <em>not impossible to solve</em>, let’s get back to the drawing board!</p>
<!-- Thus, the conditional posterior probability of $\pi$ is -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- p(\pi \mid y, \sigma_e^2, \beta, \tau^2, \theta) &= \frac{\left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\tau^2} \beta^2 \right) \theta}{\left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\tau^2} \beta^2 \right) \theta + \left(2\pi\sigma_e^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \delta_0 (1 - \theta)} \\[1em] -->
<!-- &=\frac{\text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\tau^2} \beta^2 \right) \theta}{\text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\tau^2} \beta^2 \right) \theta + \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \delta_0 (1 - \theta)} \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<h2 id="conditional-posterior-ppi-mid-beta-theta-tau2-third-attempt">Conditional posterior $p(\pi \mid \beta, \theta, \tau^2)$: Third attempt</h2>
<p>You may be surprised to hear that the thing that impedes Bayesian inference most is actually of great help here: <em>integration</em>! Upon reflection, this makes sense. How do we get rid of $\beta$, which itself depends on the unholy Dirac delta function? We integrate it out! Again tackling the case for which $\pi = 0$ first, we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 0 \mid \mathbf{y}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \, p(\mathbf{y} \mid \pi = 0, \sigma_e^2, \tau^2, \theta) \, p(\pi = 1 \mid \theta) \, p(\theta) \, p(\sigma_e^2) \, p(\tau^2) \\[1em]
&= \frac{1}{Z} \, p(\mathbf{y} \mid \pi = 0, \sigma_e^2, \tau^2, \theta) \, p(\pi = 1 \mid \theta) \\[1em]
&= \frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) (1 - \theta)\enspace ,
\end{aligned} %]]></script>
<p>where because $p(\theta)$, $p(\sigma_e^2)$, and $p(\tau^2)$ feature both in the case where $\pi = 0$ and $\pi = 1$, they can be absorbed into $Z$. For $\pi = 1$, the integration bit is a tick more involved. Using the <em>sum</em> and <em>product</em> rules of probability, we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 1 \mid \mathbf{y}, \sigma_e^2, \tau^2, \theta) &=
\frac{1}{Z} \, \int p(\mathbf{y}, \beta \mid \pi = 1, \sigma_e^2, \tau^2, \theta) \, p(\pi = 1 \mid \theta) \, \mathrm{d}\beta \\
&= \frac{1}{Z} \, \int p(\mathbf{y} \mid \pi = 1, \sigma_e^2, \tau^2, \theta) \, p(\beta \mid \pi = 1, \sigma_e^2, \tau^2, \theta) \, p(\pi = 1 \mid \theta) \, \mathrm{d}\beta \\
&= \frac{1}{Z} \, \int \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n (y_i - \beta x_i)^2 \right) \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \theta \, \mathrm{d}\beta \enspace .
\end{aligned} %]]></script>
<p>This integrand very much looks like the expression we had for the conditional posterior of $\beta$, but unnormalized. So we already know that we will get out the normalizing constant of the conditional posterior of $\beta$, in addition to some other stuff. We put everything that does not depend on $\beta$ outside of the integral:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi = 1 \mid \mathbf{y}, \sigma_e^2, \tau^2, \theta) &=
\frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \theta \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[\sum_{i=1}^n y_i^2 - 2 \beta \sum_{i=1}^n x_i y_i + \beta^2 \sum_{i=1}^n x_i^2 \right] \right) \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \, \mathrm{d}\beta \\[1em]
&= \frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \theta \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[-2 \beta \sum_{i=1}^n x_i y_i + \beta^2 \sum_{i=1}^n x_i^2 \right] \right) \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^2 \right) \, \mathrm{d}\beta \enspace ,
\end{aligned} %]]></script>
<p>where we now only focus on the integrand, call it $A$, because the margins of these pages are too small.<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup>. For the integrand, we do the exact same computation as in the derivation of the conditional posterior on $\beta$, except that when “completing the square”, we cannot cancel the term. Instead, we put it in front of the integral. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A &= \int \text{exp}\left(-\frac{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left[\left(\beta - \frac{\sum_{i=1}^n x_i y_i}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2 - \frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)^2} \right] \right) \, \mathrm{d}\beta \\[1em]
&= \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)^2} \right) \int \text{exp}\left(-\frac{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left(\beta - \frac{\sum_{i=1}^n x_i y_i}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2 \right) \, \mathrm{d}\beta \\[1em]
&= \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \int \text{exp}\left(-\frac{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}{2\sigma_e^2} \left(\beta - \frac{\sum_{i=1}^n x_i y_i}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^2 \right) \, \mathrm{d}\beta \\[1em]
&= \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(2\pi\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} \enspace ,
\end{aligned} %]]></script>
<p>where the second term of the last line is the normalizing constant of the conditional posterior on $\beta$. Let $\xi$ again be the chance parameter of the Bernoulli distribution from which we draw $\pi$. Then:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
1 - \xi &= \frac{\frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) (1 - \theta)}{\frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \theta \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(2\pi\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} + \frac{1}{Z} \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n y_i^2 \right) (1 - \theta)} \\[1em]
&= \frac{(1 - \theta)}{\left(\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i y_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} \theta + (1 - \theta)} \enspace .
\end{aligned} %]]></script>
<p>Note that this arduous adventure got rid of our nemesis, $\delta_0$. After this third and final attempt, we may take a short rest. Here is a visual break:</p>
<p><img src="../assets/img/Amsterdam-visual-break.jpg" /></p>
<p>In the remainder of the blog post, we will (a) implement this in R, (b) generalize it to $p > 1$ variables, and (c) apply it to some real data.</p>
<h2 id="implementation-in-r">Implementation in R</h2>
<p>The code below implements the spike-and-slab regression for $p = 1$ predictors:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="cd">#' Spike-and-Slab Regression using Gibbs Sampling for p = 1 predictors</span><span class="w">
</span><span class="cd">#'</span><span class="w">
</span><span class="cd">#' @param y: vector of responses</span><span class="w">
</span><span class="cd">#' @param x: vector of predictor values</span><span class="w">
</span><span class="cd">#' @param nr_samples: indicates number of samples drawn</span><span class="w">
</span><span class="cd">#' @param a1: parameter a1 of Gamma prior on variance sigma2e</span><span class="w">
</span><span class="cd">#' @param a2: parameter a2 of Gamma prior on variance sigma2e</span><span class="w">
</span><span class="cd">#' @param theta: parameter of prior over mixture weight</span><span class="w">
</span><span class="cd">#' @param burnin: number of samples we discard ('burnin samples')</span><span class="w">
</span><span class="cd">#'</span><span class="w">
</span><span class="cd">#' @returns matrix of posterior samples from parameters pi, beta, tau2, sigma2e, theta</span><span class="w">
</span><span class="n">ss_regress_univ</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4000</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w">
</span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">nr_burnin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">nr_samples</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># res is where we store the posterior samples</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'pi'</span><span class="p">,</span><span class="w"> </span><span class="s1">'beta'</span><span class="p">,</span><span class="w"> </span><span class="s1">'sigma2'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tau2'</span><span class="p">,</span><span class="w"> </span><span class="s1">'theta'</span><span class="p">)</span><span class="w">
</span><span class="c1"># take the MLE estimate as the values for the first sample</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute these quantities only once</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">var_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_xy</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># we start running the Gibbs sampler</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># first, get all the values of the previous time point</span><span class="w">
</span><span class="n">pi_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">beta_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">sigma2_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">tau2_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w">
</span><span class="n">theta_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">]</span><span class="w">
</span><span class="c1">## Start sampling from the conditional posterior distributions</span><span class="w">
</span><span class="c1">##############################################################</span><span class="w">
</span><span class="c1"># sample theta from a Beta</span><span class="w">
</span><span class="n">theta_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbeta</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">pi_prev</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pi_prev</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample sigma2e from an Inverse Gamma</span><span class="w">
</span><span class="n">sigma2_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">rgamma</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">((</span><span class="n">y</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">x</span><span class="o">*</span><span class="n">beta_prev</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample tau2 from an Inverse Gamma</span><span class="w">
</span><span class="n">tau2_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">rgamma</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">pi_prev</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="o">^</span><span class="m">2</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_prev</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="n">var_y</span><span class="p">))</span><span class="w">
</span><span class="c1"># store this as a variable since it gets computed very often</span><span class="w">
</span><span class="n">var_comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">sigma2_new</span><span class="o">/</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample beta from a Gaussian</span><span class="w">
</span><span class="n">beta_mu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_xy</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">var_comb</span><span class="w">
</span><span class="n">beta_var</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sigma2_new</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">var_comb</span><span class="w">
</span><span class="n">beta_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">beta_var</span><span class="p">))</span><span class="w">
</span><span class="c1"># compute chance parameter of the conditional posterior of pi (Bernoulli)</span><span class="w">
</span><span class="n">l</span><span class="m">0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta_new</span><span class="p">)</span><span class="w">
</span><span class="n">l</span><span class="m">1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="w">
</span><span class="nf">log</span><span class="p">(</span><span class="n">theta_new</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">.5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">sum_xy</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="n">sigma2_new</span><span class="o">*</span><span class="n">var_comb</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">.5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">beta_var</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample pi from a Bernoulli</span><span class="w">
</span><span class="n">pi_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">l</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">l</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">l</span><span class="m">0</span><span class="p">)))</span><span class="w">
</span><span class="c1"># add new samples</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">pi_new</span><span class="p">,</span><span class="w"> </span><span class="n">beta_new</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">pi_new</span><span class="p">,</span><span class="w"> </span><span class="n">sigma2_new</span><span class="p">,</span><span class="w"> </span><span class="n">tau2_new</span><span class="p">,</span><span class="w"> </span><span class="n">theta_new</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># remove the first nr_burnin number of samples</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="o">-</span><span class="n">seq</span><span class="p">(</span><span class="n">nr_burnin</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<h2 id="example-application-i">Example application I</h2>
<p>Here, we simply simulate some data to see whether we can recover the coefficient.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gen_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma2e</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="n">seed</span><span class="p">)</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="w">
</span><span class="n">X</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">b</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">sigma2e</span><span class="p">))</span><span class="w">
</span><span class="nf">list</span><span class="p">(</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="s1">'X'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gen_dat</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.3</span><span class="p">,</span><span class="w"> </span><span class="n">sigma2e</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ss_regress_univ</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">X</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">samples</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## pi beta sigma2 tau2 theta
## [1,] 1 0.2906086 0.6597533 0.90971999 0.7347514
## [2,] 1 0.1211445 0.8227258 0.19379877 0.8147812
## [3,] 1 0.2482826 0.8256208 0.21308479 0.9398529
## [4,] 1 0.2698416 0.8924097 1.27511931 0.2272394
## [5,] 1 0.2569462 0.8575250 9.26546148 0.3319193
## [6,] 1 0.3302473 0.7589350 0.05923922 0.8465538</code></pre></figure>
<p>The samples for $\beta$ are from its marginal distribution, that is, from the distribution weighted by the uncertainty about each model. We can plot this model-averaged posterior:</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-8-1.png" title="plot of chunk unnamed-chunk-8" alt="plot of chunk unnamed-chunk-8" style="display: block; margin: auto;" /></p>
<p>In this case, we have two models:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{M}_0&: \mathbf{y} = \mathbf{0} \\[0.5em]
\mathcal{M}_1&: \mathbf{y} = \mathbf{0} + \beta \mathbf{x} \enspace ,
\end{aligned} %]]></script>
<p>where we, for simplicity, set the intercepts to 0. The dashed grey line indicates the posterior mean for $\beta$ conditional on the model $\mathcal{M}_1$. The dashed black line, on the other hand, indicates the posterior mean for $\beta$ where we have taken the uncertainy across models into account.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">apply</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## pi beta sigma2 tau2 theta
## 0.8420000 0.2292838 0.9456540 18.3779151 0.6181061</code></pre></figure>
<p>From this, we can also compute the posterior inclusion odds, which is $\frac{0.84}{1 - 0.84} = 5.30$. This means that $\mathcal{M}_1$ is about 5 times more likely than $\mathcal{M}_0$. In the short primer on Bayesian inference above, we have noted that computing posterior inclusion probabilities requires assigning a prior distribution to models. This brings with it some subtleties, and we will sketch the issue of assigning priors to models at the end of this blog post. In the next section, we generalize our spike-and-slab Gibbs sampling procedure to $p > 1$ variables.</p>
<!-- One predictor is hardly the common setting in today's high-dimensional world. Luckily, the Gibbs sampling procedure outlined above translates straightforwardly into the multivariable case. In the next section, we discuss how we have to update our conditional posterior distributions in the $p > 1$ setting. We also update the R implementation, and apply the method to a data set with $p = 15$ predictors. -->
<h2 id="allowing-p--1-predictors">Allowing $p > 1$ predictors</h2>
<p>In the case of multiple predictors, the Gibbs sampling procedure changes slightly. We use independent priors over each predictor:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\beta_i &\sim (1 - \pi_i) \, \delta_0 + \pi_i \, \mathcal{N}(0, \sigma_y^2 \tau^2) \\[0.5em]
\pi_i &\sim \text{Bern}(\theta) \\[0.5em]
\theta &\sim \text{Beta}(a, b) \\[0.5em]
\tau^2 &\sim \text{Inverse-Gamma}(\alpha_1, \alpha_2) \enspace ,
\end{aligned} %]]></script>
<p>for all $i \in [1, \ldots, p]$. We again set $a = b = 1$ and $\alpha_1 = \alpha_2 = 0.01$. Note that $\tau^2$ and $\theta$ are common to all predictors. Let $\mathbf{y} \in \mathbb{R}^{n \times 1}$ be an $n$-dimensional row vector; $\mathbf{X} \in \mathbb{R}^{n \times p}$ be a $n \times p$-dimensional matrix; and $\beta \in \mathbb{R}^{p \times 1}$ be a $p$-dimensional row vector. With this notation, the residual sum of squares, which was $\sum_{i=1}^n (y_i - \beta x_i)^2$ previously, becomes $(\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta)$. Similarly, where we previously had $\beta^2$ we now have $\beta^T\beta$.</p>
<!-- The only thing that changes in the conditional posterior distribution is that: **(a)** the conditional posterior of $\theta$ uses all $\pi_i$ as updates, not just one; **(b)** $\beta^2$ and $\sum_{i=1}^n (y_i - \beta x_i)^2$ get replaced with their vector analogues, $\beta^T\beta$ and $(\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta)$; **(d)** the conditional posterior of $\beta$ becomes a $p$-dimensional Gaussian distribution with diagonal covariance matrix; **(e)** the conditional posterior of $\pi_i$ requires -->
<p>In the next sections, I provide the updated conditional posterior distributions, as well as update the R code to handle $p > 1$ predictors. Compared to the univariable case, we simply have to replace the scalar by vector quantities, except for the conditional posteriors on $\pi_i$ — these again require an integration trick. We tackle the conditional posteriors in turn.</p>
<h3 id="conditional-posterior-ptheta-mid-pi-1">Conditional posterior $p(\theta \mid \pi)$</h3>
<p>The conditional posterior of $\theta$ with $p$ predictors is:</p>
<script type="math/tex; mode=display">\theta \mid \pi \sim \text{Beta}\left(a + \sum_{i=1}^p \pi_i, b + \sum_{i=1}^n (1 - \pi_i) \right) \enspace .</script>
<p>Note that while before the posterior mean of $\theta$ was bounded between $1/3$ and $2/3$, the posterior mean is now bounded between $\frac{1}{2 + p}$ and $\frac{1 + p}{2 + p}$.</p>
<h3 id="conditional-posterior-ptau2-mid-beta-pi-1">Conditional posterior $p(\tau^2 \mid \beta, \pi)$</h3>
<p>We again have two cases for $\tau^2$, but they are slightly different compared to the univariable case. We sample from the prior if <em>all</em> $\pi_i$’s are zero. Let $\pi = (\pi_1, \ldots, \pi_p)$ be the vector of mixture weights, and let $\mathbf{0}$ be a vector of zeros of length $p$, then:</p>
<script type="math/tex; mode=display">\tau^2 \mid \beta , \pi \sim \text{Inverse-Gamma}\left(\frac{1}{2} + \frac{\sum_{i=1}^p \pi_i}{2}, \frac{s^2}{2} + \frac{\beta^T\beta}{2\sigma_y^2}\right) \enspace .</script>
<p>Note that $\beta_i = 0$ if $\pi_i = 0$, and that we thus sample from the prior if all $\pi_i$’s are zero.</p>
<h3 id="conditional-posterior-psigma_e2-mid-y-beta-1">Conditional posterior $p(\sigma_e^2 \mid y, \beta)$</h3>
<p>The conditional posterior on $\sigma_e^2$ changes only slightly:</p>
<script type="math/tex; mode=display">\sigma_e^2 \mid \mathbf{y}, \beta \sim \text{Gamma}\left(\alpha_1 + \frac{n}{2}, \alpha_2 + \frac{(\mathbf{y} - \mathbf{X}\beta)^T (\mathbf{y} - \mathbf{X}\beta)}{2}\right) \enspace .</script>
<h3 id="conditional-posterior-pbeta-mid-y-pi-tau2-sigma_e2-1">Conditional posterior $p(\beta \mid y, \pi, \tau^2, \sigma_e^2)$</h3>
<p>We could write the prior over all $\beta_i$’s as a multivariate Gaussian with a diagonal covariance matrix. With a Gaussian likelihood, this prior is conjugate, such that the conditional posterior on the regression weights $\beta$ is a multivariate Gaussian distribution. We sketch the derivation as it may be interesting in itself. The idea is to write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \pi, \tau^2, \sigma_e^2) &= \frac{1}{Z} \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) \right) \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta^T\beta\right) \\[.5em]
&= \frac{1}{Z} \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[\mathbf{y}^T\mathbf{y} - 2\beta^T\mathbf{X}^T\mathbf{y} + \beta^T\mathbf{X}^T\mathbf{X}\beta\right] -\frac{1}{2\sigma_y^2\tau^2} \beta^T\beta\right) \\[.5em]
&= \frac{1}{Z} \text{exp}\left(-\frac{1}{2} \left[- 2\beta^T\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2} + \beta^T\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2}\beta + \frac{1}{\sigma_y^2\tau^2} \beta^T\beta\right]\right) \\[.5em]
&= \frac{1}{Z} \text{exp}\left(-\frac{1}{2} \left[\beta^T\left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right) \beta - 2\beta^T\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2}\right]\right) \\[.5em]
&= \frac{1}{Z} \text{exp}\left(-\frac{1}{2} \left(\beta - \left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)^{-1}\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2}\right)^T \left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)\left(\beta - \left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)^{-1}\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2}\right)\right) \enspace .
\end{aligned} %]]></script>
<p>Thus, we draw all $\beta_i$’s from:</p>
<script type="math/tex; mode=display">\beta \mid \mathbf{y}, \pi, \tau^2, \sigma_e^2 \sim
\mathcal{N}\left(\left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)^{-1}\mathbf{X}^T\mathbf{y}\frac{1}{\sigma_e^2}, \left(\mathbf{X}^T\mathbf{X}\frac{1}{\sigma_e^2} + \mathbf{I}\frac{1}{\sigma_y^2\tau^2}\right)^{-1}\right) \enspace ,</script>
<p>where we then set the $\beta_i$’s to zero for which $\pi_i = 0$.</p>
<h3 id="conditional-posterior-ppi-mid-beta-theta-tau2">Conditional posterior $p(\pi \mid \beta, \theta, \tau^2)$</h3>
<p>Because the the individual $\pi_i$’s are conditionally independent given $\theta$, the update step is very similar to the univariable case. We compare the case where the $j^{\text{th}}$ element of $\beta$ is zero ($\pi_j = 0$) against the case where it is not zero ($\pi_j = 1$). The other indicator variables, call them $\pi_{-j}$, are whatever their current sample is. Therefore, we need to compute the probability with which $\pi_j = 1$ compared to $\pi_j = 0$, given the same values for $\pi_{-j}$. Let $\xi_j$ denote the probability that we sample $\pi_j = 1$, and let $\beta_{-j}$ denote the vector of regression weights without $\beta_j$, and for which $\beta_i = 0$ if $\pi_i = 0$. We cycle through each $\pi_j$ and compute:</p>
<script type="math/tex; mode=display">\xi_j = \frac{p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta)}{p(\pi_j = 0 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) + p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta)} \enspace .</script>
<p>We then draw $\pi_j$ from a Bernoulli with chance parameter $\xi_j$; we repeat this procedure for all $j = [1, \ldots, p]$ predictors. We start with the $\pi_j = 0$ case for which $\beta_j = 0$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi_j = 0 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \, p(\mathbf{y} \mid \pi_j = 0, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\beta_{-j} \mid \pi_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\pi \mid \theta) \, p(\theta) \, p(\tau^2) \, p(\sigma_e^2) \\[.5em]
&= \frac{1}{Z} \, p(\mathbf{y} \mid \pi_j = 0, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\pi \mid \theta) \\[.5em]
&= \frac{1}{Z} \, \left(2\pi\sigma_e^2\right)^{-\frac{n}{2}} \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}_{-j}\beta_{-j}\right)^T\left(\mathbf{y} - \mathbf{X}_{-j}\beta_{-j}\right)
\right) \sum_{i=1}^p \theta^{\pi_i}(1 - \theta)^{1 - \pi_i} \\[.5em]
&= \frac{1}{Z} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}_{-j}\beta_{-j}\right)^T\left(\mathbf{y} - \mathbf{X}_{-j}\beta_{-j}\right)
\right) (1 - \theta) \enspace ,
\end{aligned} %]]></script>
<p>where we have absorbed the terms that appear both in the posterior for $\pi_j = 0$ and $\pi_j = 1$ into $Z$. Note that in the expression above the prediction is done with only $p - 1$ predictor terms, some of which may be zero and others not, depending on the current sample. We could have written this equivalently with $\mathbf{X}\beta$ with the constraint that $\beta_j = 0$.</p>
<p>The expression for $\pi_j = 1$ requires integrating over $\beta_j$. We start with the expression that already has most of the terms in $Z$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \,p(\mathbf{y} \mid \pi_j = 1, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\pi \mid \theta) \\[.5em]
&= \frac{1}{Z} \, \int p(\mathbf{y}, \beta_j \mid \pi_j = 1, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) \, p(\pi \mid \theta) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, p(\pi \mid \theta) \, \int p(\mathbf{y} \mid \pi_j = 1, \pi_{-j}, \beta_{-j}, \beta_j, \sigma_e^2, \tau^2, \theta) \, p(\beta_j \mid \pi_j, \tau^2) \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, p(\pi \mid \theta) \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) \right) \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \sum_{i=1}^p \theta^{\pi_i}(1 - \theta)^{1 - \pi_i} \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}}\, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}}\, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \enspace .
\end{aligned} %]]></script>
<!-- To single out $\beta_j$ from $\beta$, define $\mathbf{z} = \mathbf{y} - \mathbf{X}_{-j} \beta_{-j}$ -->
<p>To single out $\beta_j$ from $\beta$, define</p>
<script type="math/tex; mode=display">\mathbf{z} = \mathbf{y} - \mathbf{X}_{-j} \beta_{-j} \enspace ,</script>
<p>as the residuals of the regression $\mathbf{y}$ on $\mathbf{X}_{-j}$.<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup> Due to linearity, we can write</p>
<script type="math/tex; mode=display">\left(\mathbf{y} - \mathbf{X}\beta\right)^T\left(\mathbf{y} - \mathbf{X}\beta\right) = \sum_{i=1}^n \left(z_i - \beta_j x_i\right)^2 \enspace ,</script>
<p>such that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n \left(z_i - \beta_j x_i\right)^2 -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[\sum_{i=1}^n z_i^2 - 2 \beta_j \sum_{i=1}^n z_i x_i + \beta_j^2 \sum_{i=1}^n x_i^2 \right] -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \sum_{i=1}^n z_i^2\right)\int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[- 2 \beta_j \sum_{i=1}^n z_i x_i + \beta_j^2 \sum_{i=1}^n x_i^2 \right] -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \mathbf{z}^T\mathbf{z} \right) \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[- 2 \beta_j \sum_{i=1}^n z_i x_i + \beta_j^2 \sum_{i=1}^n x_i^2 \right] -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \\[.5em]
&= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}_{-j} \beta_{-j}\right)^T\left(\mathbf{y} - \mathbf{X}_{-j} \beta_{-j}\right) \right) \, \int \text{exp}\left(-\frac{1}{2\sigma_e^2} \left[- 2 \beta_j \sum_{i=1}^n z_i x_i + \beta_j^2 \sum_{i=1}^n x_i^2 \right] -\frac{1}{2\sigma_y^2\tau^2} \beta_j^2\right) \, \mathrm{d}\beta_j \enspace ,
\end{aligned} %]]></script>
<p>which is a very similar integration problem as in the univariable case. The same trick holds here: we remove all terms that do not depend on $\beta_j$ from the integral, complete the square, and find the normalizing constant of a Gaussian. In fact, the steps are exactly the same as above, except that we have $z_i$ instead of $y_i$, and so we just give the solution:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi_j = 1 \mid \mathbf{y}, \pi_{-j}, \beta_{-j}, \sigma_e^2, \tau^2, \theta) &= \frac{1}{Z} \, \theta \, \left(2\pi\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \, \text{exp}\left(-\frac{1}{2\sigma_e^2} \left(\mathbf{y} - \mathbf{X}_{-j} \beta_{-j}\right)^T\left(\mathbf{y} - \mathbf{X}_{-j} \beta_{-j}\right) \right) \, \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i z_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(2\pi\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} \enspace .
\end{aligned} %]]></script>
<p>The conditional posterior of $\pi_j = 0$ is therefore a Bernoulli distribution with (1 minus) chance parameter:</p>
<script type="math/tex; mode=display">1 - \xi_j = \frac{(1 - \theta)}{\left(\sigma_y^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(\frac{\left(\sum_{i=1}^n x_i z_i\right)^2}{2\sigma_e^2\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)} \right) \left(\frac{\sigma_e^2}{\left(\sum_{i=1}^n x_i^2 + \frac{\sigma_e^2}{\sigma_y^2\tau^2}\right)}\right)^{\frac{1}{2}} \theta + (1 - \theta)} \enspace ,</script>
<p>where $z_j$ changes depending which $\beta_j$ we currently sample.</p>
<h2 id="implementation-in-r-1">Implementation in R</h2>
<p>The implementation changes only slightly:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="cd">#' Spike-and-Slab Regression using Gibbs Sampling for p > 1 predictors</span><span class="w">
</span><span class="cd">#'</span><span class="w">
</span><span class="cd">#' @param y: vector of responses</span><span class="w">
</span><span class="cd">#' @param X: matrix of predictor values</span><span class="w">
</span><span class="cd">#' @param nr_samples: indicates number of samples drawn</span><span class="w">
</span><span class="cd">#' @param a1: parameter a1 of Gamma prior on variance sigma2e</span><span class="w">
</span><span class="cd">#' @param a2: parameter a2 of Gamma prior on variance sigma2e</span><span class="w">
</span><span class="cd">#' @param theta: parameter of prior over mixture weight</span><span class="w">
</span><span class="cd">#' @param burnin: number of samples we discard ('burnin samples')</span><span class="w">
</span><span class="cd">#'</span><span class="w">
</span><span class="cd">#' @returns matrix of posterior samples from parameters pi, beta, tau2, sigma2e, theta</span><span class="w">
</span><span class="n">ss_regress</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6000</span><span class="p">,</span><span class="w"> </span><span class="n">nr_burnin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">nr_samples</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">p</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w">
</span><span class="c1"># res is where we store the posterior samples</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">p</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s1">'pi'</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="p">)),</span><span class="w">
</span><span class="n">paste0</span><span class="p">(</span><span class="s1">'beta'</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="p">)),</span><span class="w">
</span><span class="s1">'sigma2e'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tau2'</span><span class="p">,</span><span class="w"> </span><span class="s1">'theta'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># take the MLE estimate as the values for the first sample</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">p</span><span class="p">),</span><span class="w"> </span><span class="n">coef</span><span class="p">(</span><span class="n">m</span><span class="p">),</span><span class="w"> </span><span class="n">var</span><span class="p">(</span><span class="n">predict</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute only once</span><span class="w">
</span><span class="n">XtX</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">X</span><span class="w">
</span><span class="n">Xty</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">X</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">y</span><span class="w">
</span><span class="n">var_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="n">var</span><span class="p">(</span><span class="n">y</span><span class="p">))</span><span class="w">
</span><span class="c1"># we start running the Gibbs sampler</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># first, get all the values of the previous time point</span><span class="w">
</span><span class="n">pi_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="p">)]</span><span class="w">
</span><span class="n">beta_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">p</span><span class="p">)]</span><span class="w">
</span><span class="n">sigma2e_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">tau2_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">theta_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">res</span><span class="p">)]</span><span class="w">
</span><span class="c1">## Start sampling from the conditional posterior distributions</span><span class="w">
</span><span class="c1">##############################################################</span><span class="w">
</span><span class="c1"># sample theta from a Beta</span><span class="w">
</span><span class="n">theta_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbeta</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">),</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pi_prev</span><span class="p">))</span><span class="w">
</span><span class="c1"># sample sigma2e from an Inverse-Gamma</span><span class="w">
</span><span class="n">err</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta_prev</span><span class="w">
</span><span class="n">sigma2e_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">rgamma</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">err</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">err</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample tau2 from an Inverse Gamma</span><span class="w">
</span><span class="n">tau2_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">rgamma</span><span class="p">(</span><span class="w">
</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">pi_prev</span><span class="p">),</span><span class="w">
</span><span class="n">s</span><span class="o">^</span><span class="m">2</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">beta_prev</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">beta_prev</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="n">var_y</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample beta from multivariate Gaussian</span><span class="w">
</span><span class="n">beta_cov</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">qr.solve</span><span class="p">((</span><span class="m">1</span><span class="o">/</span><span class="n">sigma2e_new</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">XtX</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">),</span><span class="w"> </span><span class="n">p</span><span class="p">))</span><span class="w">
</span><span class="n">beta_mean</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">beta_cov</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">Xty</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="n">sigma2e_new</span><span class="p">)</span><span class="w">
</span><span class="n">beta_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mvtnorm</span><span class="o">::</span><span class="n">rmvnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mean</span><span class="p">,</span><span class="w"> </span><span class="n">beta_cov</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample each pi_j in random order</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">j</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">sample</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="n">p</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># get the betas for which beta_j is zero</span><span class="w">
</span><span class="n">pi0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pi_prev</span><span class="w">
</span><span class="n">pi0</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">bp0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">beta_new</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">pi0</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute the z variables and the conditional variance</span><span class="w">
</span><span class="n">xj</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">X</span><span class="p">[,</span><span class="w"> </span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="n">z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">bp0</span><span class="w">
</span><span class="n">cond_var</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="nf">sum</span><span class="p">(</span><span class="n">xj</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">sigma2e_new</span><span class="o">/</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">))</span><span class="w">
</span><span class="c1"># compute chance parameter of the conditional posterior of pi_j (Bernoulli)</span><span class="w">
</span><span class="n">l</span><span class="m">0</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta_new</span><span class="p">)</span><span class="w">
</span><span class="n">l</span><span class="m">1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="w">
</span><span class="nf">log</span><span class="p">(</span><span class="n">theta_new</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">.5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">tau2_new</span><span class="o">*</span><span class="n">var_y</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="nf">sum</span><span class="p">(</span><span class="n">xj</span><span class="o">*</span><span class="n">z</span><span class="p">)</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="n">sigma2e_new</span><span class="o">*</span><span class="n">cond_var</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">.5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma2e_new</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">cond_var</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># sample pi_j from a Bernoulli</span><span class="w">
</span><span class="n">pi_prev</span><span class="p">[</span><span class="n">j</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">l</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="nf">exp</span><span class="p">(</span><span class="n">l</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">l</span><span class="m">0</span><span class="p">)))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">pi_new</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">pi_prev</span><span class="w">
</span><span class="c1"># add new samples</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">pi_new</span><span class="p">,</span><span class="w"> </span><span class="n">beta_new</span><span class="o">*</span><span class="n">pi_new</span><span class="p">,</span><span class="w"> </span><span class="n">sigma2e_new</span><span class="p">,</span><span class="w"> </span><span class="n">tau2_new</span><span class="p">,</span><span class="w"> </span><span class="n">theta_new</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># remove the first nr_burnin number of samples</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="o">-</span><span class="n">seq</span><span class="p">(</span><span class="n">nr_burnin</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>We might want to run not only one Markov chain, as <em>ss_regress</em> does, but several; and we might also want to run them in parallel, which is achieved by the following wrapper:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'doParallel'</span><span class="p">)</span><span class="w">
</span><span class="n">registerDoParallel</span><span class="p">(</span><span class="n">cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">)</span><span class="w">
</span><span class="cd">#' Calls the ss_regress function in parallel</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @params same as ss_regress</span><span class="w">
</span><span class="cd">#' @params nr_cores: numeric, number of cores to run ss_regress in parallel</span><span class="w">
</span><span class="cd">#' @returns a list with nr_cores entries which are posterior samples</span><span class="w">
</span><span class="n">ss_regressm</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6000</span><span class="p">,</span><span class="w">
</span><span class="n">nr_burnin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">round</span><span class="p">(</span><span class="n">nr_samples</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="n">nr_cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">foreach</span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">nr_cores</span><span class="p">),</span><span class="w"> </span><span class="n">.combine</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rbind</span><span class="p">)</span><span class="w"> </span><span class="o">%dopar%</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">ss_regress</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">X</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">a</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">a</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">theta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">theta</span><span class="p">,</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">b</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">s</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">,</span><span class="w">
</span><span class="n">nr_burnin</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_burnin</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">samples</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<h2 id="example-application-ii">Example Application II</h2>
<p>We use a data set on (aggregated) attitudes of clerical employees in a large financial organization. We want to predict the overall rating based on answers to seven questions, which are our predictors:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">data</span><span class="p">(</span><span class="n">attitude</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">attitude</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## rating complaints privileges learning raises critical advance
## 1 43 51 30 39 61 92 45
## 2 63 64 51 54 63 73 47
## 3 71 70 68 69 76 86 48
## 4 61 63 45 47 54 84 35
## 5 81 78 56 66 71 83 47
## 6 43 55 49 44 54 49 34</code></pre></figure>
<p>We $z$-standardize our variables which forces the intercept to be zero. We do this because we have, for simplicity, neglected to include an intercept in our Gibbs sampling derivations.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">std</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">sd</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">attitude_z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">attitude</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">std</span><span class="p">)</span><span class="w">
</span><span class="n">yz</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">attitude_z</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">Xz</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">attitude_z</span><span class="p">[,</span><span class="w"> </span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="n">samples</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ss_regressm</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">yz</span><span class="p">,</span><span class="w"> </span><span class="n">X</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Xz</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">1</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="m">2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.01</span><span class="p">,</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">s</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_cores</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4000</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">post_means</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">apply</span><span class="p">(</span><span class="n">samples</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">)</span><span class="w">
</span><span class="n">res_table</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="w">
</span><span class="n">post_means</span><span class="p">[</span><span class="n">grepl</span><span class="p">(</span><span class="s1">'beta'</span><span class="p">,</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">post_means</span><span class="p">))],</span><span class="w">
</span><span class="n">post_means</span><span class="p">[</span><span class="n">grepl</span><span class="p">(</span><span class="s1">'pi'</span><span class="p">,</span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">post_means</span><span class="p">))]</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">rownames</span><span class="p">(</span><span class="n">res_table</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colnames</span><span class="p">(</span><span class="n">Xz</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res_table</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Post. Mean'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Post. Inclusion'</span><span class="p">)</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">res_table</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Post. Mean Post. Inclusion
## complaints 0.601 0.998
## privileges -0.011 0.319
## learning 0.211 0.692
## raises 0.058 0.425
## critical 0.007 0.286
## advance -0.079 0.418</code></pre></figure>
<p>We can also visualize these results:</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-15-1.png" title="plot of chunk unnamed-chunk-15" alt="plot of chunk unnamed-chunk-15" style="display: block; margin: auto;" /></p>
<p>We are certain to include only the predictor variable <em>complaints</em>. There remains large uncertainty as to whether the other variables are associated, or not associated, with the outcome.</p>
<p>As an aside, there are also other options than specifying independent priors over the $\beta$’s, which is what we have done in our setup. The most popular prior specification is based on Zellner’s (1986) $g$-prior:</p>
<script type="math/tex; mode=display">\beta \mid g \sim \mathcal{N}\left(0, g \, \sigma_y^2 \left(\mathbf{X}^T\mathbf{X}\right)^{-1}\right) \enspace ,</script>
<p>where $g = \tau^2$ in our terminology and which does not have a diagonal covariance matrix but one that is scaled by $\left(\mathbf{X}^T\mathbf{X}\right)^{-1}$. Liang et al. (2008) propose various ways to deal with $g$. One of them, as discussed in this blog post, is to assign $g$ an inverse Gamma distribution which leads to a (multivariate) marginal Cauchy distribution on $\beta$. Som, Hans, & MacEachern (2016) point out an interesting problem that may arise when using, as we have done in this blog post, a single global $g$ or $\tau^2$ parameter. Li & Clyde (2018) unify various approaches in a general framework that extends to generalized linear models.<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup> In the next section, I briefly sketch some subtleties in assigning a prior to models.</p>
<h2 id="prior-on-models">Prior on Models</h2>
<p>We have seen that the Gibbs sampler with spike-and-slab priors can yield model-averaged parameter estimates as well as posterior inclusion probabilities. However, in the first section of this blog post, I have pointed out that this is only possible once we assign priors to models. Have we done so? Yes, albeit implicitly. We have $2^p$ possible models, where a model simply indexes which of the $\pi_i$’s equal 1 and which equal 0. For example, the model with zero predictors has $\pi = \mathbf{0}$, whereas the model which includes all predictors has $\pi = \mathbf{1}$. Thus, a prior assigned to $\pi_i$ constitutes a prior assigned to models. The independent spike-and-slab prior specification described above yields:</p>
<script type="math/tex; mode=display">\begin{aligned}
p(\pi) = \int \prod_{i=1}^m \theta^{\pi_i} (1 - \theta)^{1 - \pi_i} \, p(\theta) \, \mathrm{d}\theta \enspace .
\end{aligned}</script>
<p>In the next two sections, we will discuss the implications of different choices for $p(\theta)$.</p>
<h3 id="uniform-on-models-non-uniform-on-model-size">Uniform on Models, Non-uniform on Model Size</h3>
<p>Let’s focus on the special case $\theta = \frac{1}{2}$ for a moment. This yields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi) &= \prod_{i=1}^m \left(\frac{1}{2}\right)^{\pi_i} \left(1 - \frac{1}{2}\right)^{1 - \pi_i} \\[.5em]
&= \left(\frac{1}{2}\right)^{\sum_{i=1}^p \pi_i} \left(\frac{1}{2}\right)^{p - \sum_{i=1}^p \pi_i} \\[.5em]
&= \frac{1}{2^p} \enspace ,
\end{aligned} %]]></script>
<p>the uniform prior over all models. It may be surprising to hear that this uniform prior over models induces a non-uniform prior on <em>model size</em>. To see this, let’s introduce the new random variable $K = \sum_{i=1}^p \pi_i$, which counts the number of active predictors and thus constitutes the <em>size</em> of a model. Now that we focus on $K$ instead of the individual $\pi_i$’s, we do not care which particular $\pi_i$’s are zero or not, but only how many of them are non-zero. Resultingly, there are ${p \choose k}$ possible ways of obtaining $K = k$ active predictors, and the prior distribution distribution assigned to $K$ becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(K) &= {p \choose k} \left(\frac{1}{2}\right)^{k} \left(\frac{1}{2}\right)^{p - k} \\[.5em]
&= {p \choose k} \frac{1}{2^k} \enspace ,
\end{aligned} %]]></script>
<p>which is a Binomial distribution with $\theta = \frac{1}{2}$, encoding the prior expectation that half of the predictor variables will be included. To further see that a uniform prior over models leads to a non-uniform prior over model size, assume that we have $p = 2$ predictors and thus $m = 2^2 = 4$ models. The uniform prior on models assigns a probability of $\frac{1}{4}$ to all models coded in terms of $\pi$ as $[(0, 0), (1, 0), (0, 1), (1, 1)]$. However, there is only ${2 \choose 0} = {2 \choose 2} = 1$ way to get a model that includes zero or both predictors, while there are ${2 \choose 1} = 2$ ways to get models that include one predictor. Thus, models that are of size one (i.e., either include $\beta_1$ or $\beta_2$) get assigned <em>double</em> the amount of probability mass than models that include zero or both predictors; for a visual illustration, see the figure below.</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-18-1.png" title="plot of chunk unnamed-chunk-18" alt="plot of chunk unnamed-chunk-18" style="display: block; margin: auto;" /></p>
<h3 id="uniform-on-model-size-non-uniform-on-models">Uniform on Model Size, Non-uniform on Models</h3>
<p>We may be uncomfortable with the prior expectation that half of the variables are included, i.e. that $\theta = \frac{1}{2}$. In our spike-and-slab prior specification above, we have instead assigned $\theta$ a Beta prior. This leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\pi) &= \int \prod_{i=1}^m \theta^{\pi_i} (1 - \theta)^{1 - \pi_i} \, \frac{1}{\text{B}(a, b)} \theta^{a - 1}(1 - \theta)^{b - 1}\, \mathrm{d}\theta \\[.5em]
&= \frac{1}{\text{B}(a, b)} \int \theta^{\sum_{i=1}^p \pi_i + a - 1} (1 - \theta)^{\sum_{i=1}^p (1 - \pi_i) + b - 1} \, \mathrm{d}\theta \\[.5em]
&= \frac{\text{B}\left(a + \sum_{i=1}^p \pi_i, b + \sum_{i=1}^p (1 - \pi_i)\right)}{\text{B}(a, b)} \\[.5em]
&= \frac{\text{B}\left(a + \sum_{i=1}^p \pi_i, b + p - \sum_{i=1}^p \pi_i\right)}{\text{B}(a, b)} \enspace ,
\end{aligned} %]]></script>
<p>where we have recognized the integrand as the kernel of a Beta distribution.</p>
<p>We can again study the implied prior on model size. Using the same intuition as above, the distribution assigned to $K$ becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(K = k) &= {p \choose k} \, \frac{\text{B}(a + k, b + p - k)}{\text{B}(a, b)} \,
\end{aligned} %]]></script>
<p>which is not a Binomial but a <em>Beta-binomial</em> distribution. Assuming again that we have $p = 2$ predictors and thus $m = 2^2 = 4$ models, and that $a = b = 1$ as above, this setup induces a uniform distribution over $K$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">dbetabin</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">p</span><span class="p">,</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">beta</span><span class="p">(</span><span class="n">a</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">k</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">p</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">k</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">beta</span><span class="p">(</span><span class="n">a</span><span class="p">,</span><span class="w"> </span><span class="n">b</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">dbetabin</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 0.3333333 0.3333333 0.3333333</code></pre></figure>
<p>Conversely, this implies a non-uniform prior over models. In particular, this prior setup assigns more mass on extremely sparse or extremely dense models. To see this, note again that there is only ${2 \choose 0} = {2 \choose 2} = 1$ way to get a model that includes zero or both predictors, while there are two ${2 \choose 1} = 2$ ways to get a model that includes one predictor. Thus, models that are of size one (i.e., either include $\beta_1$ or $\beta_2$) get assigned only <em>half</em> as much probability mass than models that include zero or both predictors; for a visual illustration, see the figure below.</p>
<p><img src="/assets/img/2019-03-31-Spike-and-Slab.Rmd/unnamed-chunk-20-1.png" title="plot of chunk unnamed-chunk-20" alt="plot of chunk unnamed-chunk-20" style="display: block; margin: auto;" /></p>
<p>Especially with a large number of predictors, we might be wary of the assumption that the model which includes no predictor and the model which includes all predictors are the most likely models a priori.<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup> We can think of priors assigned to models and model size as formalizing how <em>sparse</em> we think the part of the world we are modeling is. The wonderful thing about using Bayesian statistics to quantify uncertainty is that these assumptions are out in the open. This by itself does not imply, however, that variable selection seizes to be a difficult and nuanced problem.</p>
<!-- # Issues with Gibbs sampling -->
<!-- Can sample from a joint without it being proper, see Hobert & Casella ([1996](https://www.tandfonline.com/doi/abs/10.1080/01621459.1996.10476714)). -->
<!-- ## Variable selection is harder than you think -->
<!-- Applying the Gibbs sampler using the spike-and-slab prior is also known under *stochastic search variable selection*. At first sight, it looks like a panacea: we get a posterior distribution over all possible $2^p$ models from which, if desired, inclusion Bayes factors can be computed. There are two remaining questions, however, whose answer will disfavour the spike-and-slab as a tool for variable selection in the general regression case. First, what prior over models do we use? Second, is the spike-and-slab prior a good prior, that is, does it fulfill a number of desiderata? [Link](https://www.tandfonline.com/doi/full/10.1080/01621459.2018.1469992) -->
<!-- # Discussion -->
<!-- One issue with the Gibbs sampler is that its efficiency decreases when the variables are correlated. -->
<!-- [data-example](https://cran.r-project.org/web/packages/BAS/vignettes/BAS-vignette.html) -->
<h1 id="conclusion">Conclusion</h1>
<p>If you have stayed with me until the bitter end, awesome! We have covered a lot in this blog post. In particular, we have tackled the problem of variable selection using a Bayesian approach which allowed us to quantify and incorporate uncertainty about parameters as well as models. We have focused on linear regression with spike-and-slab priors and derived a Gibbs sampler for the single and multiple predictor case. Applying this to simulated and real data, we have seen how this leads to model-averaged parameter estimates, as well as uncertainty estimates about whether or not to include a particular predictor variable. Lastly, we have discussed the nuances of assigning priors to models. If you want to read up on any of these topics, I encourage you to check out the references below. Otherwise, hope to see you next month!</p>
<hr />
<p><em>I would like to thank Don van den Bergh, Max Hinne, and Maarten Marsman for discussions about the Gibbs sampler, and Sophia Crüwell for comments on this blog post.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Lindley, Dennis (<a href="https://www.amazon.com/Making-Decisions-2nd-Dennis-Lindley/dp/0471908088">1991</a>). <em>Making Decisions (2 ed.)</em>. New Jersey, US: Wiley.</li>
<li>George, E. I. (<a href="https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2000.10474336">2004</a>). The Variable Selection Problem. <em>Journal of the American Statistical Association, 95</em>(452), 1304-1308.</li>
<li>Clyde, M., & George, E. I. (<a href="https://bit.ly/2uzS91Q">2004</a>). Model uncertainty. <em>Statistical Science,19</em>(1) 81-94.</li>
<li>Hinne, M., Gronau, Q. F., van den Bergh, D., & Wagenmakers, E. J. (<a href="https://psyarxiv.com/wgb64/">2019</a>). A conceptual introduction to Bayesian Model Averaging. doi: 10.31234/osf.io/wgb64.</li>
<li>Robert, C., & Casella, G. (<a href="https://www.jstor.org/stable/23059158">2011</a>). A short history of Markov chain Monte Carlo: Subjective recollections from incomplete data. <em>Statistical Science, 26</em>(1), 102-115.</li>
<li>McElreath, R. (<a href="https://xcelab.net/rm/statistical-rethinking/">2015</a>). <em>Statistical Rethinking: A Bayesian course with examples in R and Stan</em>. London, UK: Chapman and Hall/CRC.</li>
<li>Matthews, R. (<a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9639.00013">2001</a>). Storks deliver babies (p = 0.008). <em>Teaching Statistics, 22</em>(2), 36-38.</li>
<li>Dawid, A. P. (<a href="http://www.jmlr.org/proceedings/papers/v6/dawid10a/dawid10a.pdf">2010</a>). Beware of the DAG! In <em>Proceedings of the NIPS 2008 Workshop on Causality. Journal of Machine Learning Research Workshop and Conference Proceedings, (6)</em> 59–86.</li>
<li>Dawid, A. P. (<a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1979.tb01052.x">1979</a>). Conditional independence in statistical theory. <em>Journal of the Royal Statistical Society: Series B (Methodological), 41</em>(1), 1-15.</li>
<li>Casella, G., & George, E. I. (<a href="https://www.tandfonline.com/doi/abs/10.1080/00031305.1992.10475878">1992</a>). Explaining the Gibbs sampler. <em>The American Statistician, 46</em>(3), 167-174.</li>
<li>George, E. I., & McCulloch, R. E. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1993.10476353">1993</a>). Variable selection via Gibbs Sampling. <em>Journal of the American Statistical Association, 88</em>(423), 881-889.</li>
<li>O’Hara, R. B., & Sillanpää, M. J. (<a href="https://projecteuclid.org/euclid.ba/1340370391">2009</a>). A review of Bayesian variable selection methods: what, how and which. <em>Bayesian Analysis, 4</em>(1), 85-117.</li>
<li>George, E. I., & McCulloch, R. E. (<a href="https://www.jstor.org/stable/24306083">1997</a>). Approaches for Bayesian variable selection. <em>Statistica Sinica, 7</em>(2), 339-373.</li>
<li>Geweke, J. (<a href="https://bit.ly/2Oy5wIV">1994</a>). Variable selection and model comparison in regression. In <em>Bayesian Statistics 5: Proceedings of the 5<sup>th</sup> Valencia International Meeting</em>, 1-30.</li>
<li>Zellner, A. (1986). On Assessing Prior Distributions and Bayesian Regression Analysis With <em>g</em>-Prior Distributions. In <em>Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti</em>, 233-243. The Netherlands, Amsterdam: Elsevier.</li>
<li>Liang, F., Paulo, R., Molina, G., Clyde, M. A., & Berger, J. O. (<a href="https://amstat.tandfonline.com/doi/abs/10.1198/016214507000001337">2008</a>). Mixtures of <em>g</em>-priors for Bayesian variable selection. <em>Journal of the American Statistical Association, 103</em>(481), 410-423.</li>
<li>Som, A., Hans, C. M., & MacEachern, S. N. (<a href="https://academic.oup.com/biomet/article-abstract/103/4/993/2659028">2016</a>). A conditional Lindley paradox in Bayesian linear models. <em>Biometrika, 103</em>(4), 993-999.</li>
<li>Li, Y., & Clyde, M. A. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2018.1469992">2018</a>). Mixtures of <em>g</em>-priors in generalized linear models. <em>Journal of the American Statistical Association, 113</em>(524), 1828-1845.</li>
</ul>
<h2 id="footnotes">Footnotes</h2>
<hr />
<div class="footnotes">
<ol>
<li id="fn:1">
<p>For a very concise overview of variable selection, see George (<a href="https://amstat.tandfonline.com/doi/abs/10.1080/01621459.2000.10474336">2004</a>). For a good overview of model uncertainty, see Clyde & George (<a href="https://bit.ly/2uzS91Q">2004</a>). For a conceptual introduction to model-averaging, see Hinne, Gronau, van den Bergh, & Wagenmakers (2019). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>For mathematical details, see for example Casella & George (<a href="https://www.tandfonline.com/doi/abs/10.1080/00031305.1992.10475878">1992</a>). <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>Although I usually try to provide some historical context, this blog post is already quite long. To keep it short, and if you are interested, I recommend you read Robert & Casella (<a href="https://www.jstor.org/stable/23059158?casa_token=vsTr22q7O4sAAAAA:Z-8SrJZeH-pGcKO0uiNArdtQyQhIKLK8BzO4KQ5dDkeuqlR_oBZ5fRVbwpuBwA_SQJ5XANs5NRugrB1QnsMYpMaHovzzvYhoXOsLF7q8qxYrHnIJ7TQ&seq=1#metadata_info_tab_contents">2011</a>). <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>The actual symbol for conditional independence, introduced by Dawid (<a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1979.tb01052.x">1979</a>), differs from $\perp$ in that it has two vertical lines. However, MathJax does not have the correct symbol in its library. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>Dennis Lindley also has a second paradox named after him, see <a href="https://www.bayesianspectacles.org/dennis-lindleys-second-paradox/">here</a> — which is a little tongue in cheek. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>Two things. First, you really shouldn’t read these blog posts on your phone! Second, too small margins might remind you of an expression by Fermat, who used this as a justification to not give a proof of his famous last theorem. I recently read an absolutely captivating book about Fermat’s last theorem which might interest you; see <a href="https://www.goodreads.com/book/show/38412.Fermat_s_Enigma">here</a>. <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>It is called <em>regressing $\mathbf{y}$ on $\mathbf{X}$</em> <a href="https://stats.stackexchange.com/questions/207425/why-do-we-say-the-outcome-variable-is-regressed-on-the-predictors">because we project the response on the predictors</a>. <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>The regression implementation in the <a href="https://richarddmorey.github.io/BayesFactor/">BayesFactor</a> R package is based on the model selection approach discussed in Liang et al. (2008), while the <a href="https://merliseclyde.github.io/BAS/">BAS</a> R package and <a href="https://jasp-stats.org/">JASP</a> use the framework described in Li & Clyde (2018). You might find it insightful to compare the analysis results we have gotten here with the results when using these packages. See <a href="https://gist.github.com/fdabl/58e9a7d27623ec545cc3d1d5fc3dc600">this</a> gist for a comparison. <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>It is generally unlikely that there are many large effects; Gelman uses what he calls the <a href="https://statmodeling.stat.columbia.edu/2017/12/15/piranha-problem-social-psychology-behavioral-economics-button-pushing-model-science-eats/"><em>Piranha argument</em></a> to justify this claim: if there were many large effects, then they would interfere with each other. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian Dablander“Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using spike-and-slab priors and the Gibbs sampler, a computational method to sample from a joint distribution using only conditional distributions. Variable selection is a beast. To slay it, we must draw on ideas from different fields. We have to discuss the basics of Bayesian inference which motivates our principal weapon, the Gibbs sampler. As an instruction manual, we apply it to a simple example: drawing samples from a bivariate Gaussian distribution (for pre-combat exercises, see here). The Gibbs sampler feeds on conditional distributions. To be able to derive those easily, we need to equip ourselves with $d$-separation and directed acyclic graphs (DAGs). Having trained and become stronger, we attack variable selection in the linear regression case using Gibbs sampling with spike-and-slab priors. These priors are special in that they are a discrete mixture of a Dirac delta function — which can shrink regression coefficients exactly to zero — and a Gaussian distribution. We tackle the single predictor case first, and then generalize it to $p > 1$ predictors. For $p$ predictors, the Gibbs sampler with spike-and-slab priors yields a posterior distribution over all possible $2^p$ regression models, an enormous feat. From this, posterior inclusion probabilities and model-averaged parameter estimates follow straightforwardly. To wield this weapon in practice, we implement the method in R and engage in variable selection on simulated and real data. Seems like we have a lot to cover, so let’s get started! Quantifying uncertainty Bayesian inference is an excellent tool for uncertainty quantification. Assume you have assigned a prior distribution to some parameter $\beta$ of a model $\mathcal{M}$, call it $p(\beta \mid \mathcal{M})$. After you have observed data $\mathbf{y}$, how should you update your belief to arrive at the posterior, $p(\beta \mid y, \mathcal{M})$? The rules of probability dictate: The computationally easy parts of the right-hand side is the specification of the prior and, unless you do crazy things, also the likelihood. The tough bit is the marginal likelihood or normalizing constant which, as the name implies, makes the posterior distribution integrate to one, as all proper probability distributions must. In contrast to differentiation, which is a local operation, integration is a global operation and is thus much harder. It becomes even harder with many parameters. Usually, Bayes’ rule is given without conditioning on the model, $\mathcal{M}$. However, this assumes that we know one model to be true with certainty, thus ignoring the uncertainty we have about the models. We can apply Bayes’ rule not only on parameters, but also on models: where $m$ is the number of all models and is in fact the marginal likelihood of our first equation. To illustrate how one could do variable selection, assume we have two models, $\mathcal{M}_1$ and $\mathcal{M}_2$, which differ in their number of predictors: If these two are the only models we consider, then we can quantify their respective merits using posterior odds: where we can interpret the Bayes factor as an indicator for how much more likely the data are under $\mathcal{M}_4$, which includes $\beta_2$, compared to $\mathcal{M}_2$, which does not include $\beta_2$. However, two additional regression models are possible: In general, if $p$ is the number of predictors, then there are $2^p$ possible regression models in total. If we ignore some of those a priori, we will have violated Cromwell’s rule, which states that we should never assign prior probabilities of zero to things that could possibly happen. Otherwise, regardless of the evidence, we would never change our mind. As Dennis Lindley put it, we should “[…] leave a little probability for the moon being made of green cheese; it can be as small as 1 in a million, but have it there since otherwise an army of astronauts returning with samples of the said cheese will leave you unmoved.” (Lindley, 1991, p. 101) One elegant aspect about the Bayes factor is that we do not need to compute the normalizing constant of all models (it cancels in the ratio), which would require us to enumerate and assign priors to all possible models. If we are willing to do this, however, then we can model-average to get a posterior distribution of $\beta_j$ that takes into account the uncertainty about all $m$ models: which requires computing the posterior distribution over the parameter of interest $\beta_j$ in each model $\mathcal{M}_j$, as well as the posterior distribution over all such models. Needless to say, this is a difficult problem; the bulk of this blog post is to find an efficient way to do this in the context of linear regression models. For variable selection, we might be interested in another quantity: the posterior probability that $\beta_j \neq 0$, averaged over all models. We can arrive at this by similar means: Note that conditional on a model $\mathcal{M}_i$, $\beta_j$ is either zero or not zero. Therefore, all the terms in which $\beta_j$ is zero drop out of the sum, and we are left with summing the posterior model probabilities for the models in which $\beta_j \neq 0$. This model-averaging perspective strikes me as a very elegant approach to variable selection.1 In the remainder of this blog post, we will solve this variable selection problem for linear regression using the Gibbs sampler with spike-and-slab priors. Gibbs sampling Much of the advent in Bayesian inference in the last few decades is due to methods that arrive at the posterior distribution without calculating the marginal likelihood. One such method is the Gibbs sampler, which breaks down a high-dimensional problem into a number of smaller low-dimensional problems. It’s really one of the coolest things in statistics: it samples from the joint posterior distribution and its marginals by sampling from the conditional posterior distributions. To prove that it works mathematically is not trivial, and beyond this already lengthy introductory blog post.2 Thus, instead of getting bogged down in the technical details, let’s take a look at a motivating example. Sampling from a bivariate Gaussian To illlustrate, let $X_1$ and $X_2$ be bivariate normally distributed random variables with population mean zero ($\mu_1 = \mu_2 = 0$), unit variance ($\sigma_1^2 = \sigma_2^2 = 1$), and correlation $\rho$. As you may recall from a previous blogpost, the conditional Gaussian distribution of $X_1$ given $X_2 = x_2$, and $X_2$ given $X_1 = x_1$, respectively, are: The Gibbs sampler makes it so that if we sample repeatedly from these two conditional distributions: then these will be samples from the joint distribution $p(X_1, X_2)$ and its marginals. To illustrate, we implement this Gibbs sampler in R. For a very concise overview of variable selection, see George (2004). For a good overview of model uncertainty, see Clyde & George (2004). For a conceptual introduction to model-averaging, see Hinne, Gronau, van den Bergh, & Wagenmakers (2019). ↩ For mathematical details, see for example Casella & George (1992). ↩Two properties of the Gaussian distribution2019-02-28T10:30:00+00:002019-02-28T10:30:00+00:00https://fabiandablander.com/statistics/Two-Properties<!-- In a previous blog post, we looked talked about the method of least squares, a development in statistics Stigler deems as important as calculus for mathematics. -->
<p>In a <a href="https://fdabl.github.io/statistics/Curve-Fitting-Gaussian.html">previous</a> blog post, we looked at the history of least squares, how Gauss justified it using the Gaussian distribution, and how Laplace justified the Gaussian distribution using the central limit theorem. The Gaussian distribution has a number of special properties which distinguish it from other distributions and which make it easy to work with mathematically. In this blog post, I will focus on two of these properties: being closed under (a) <em>marginalization</em> and (b) <em>conditioning</em>. This means that, if one starts with a $p$-dimensional Gaussian distribution and marginalizes out or conditions on one or more of its components, the resulting distribution will still be Gaussian.</p>
<p>This blog post has two parts. First, I will introduce the joint, marginal, and conditional Gaussian distributions for the case of two random variables; an interactive Shiny app illustrates the differences between them. Second, I will show mathematically that the marginal and conditional distribution do indeed have the form I presented in the first part. I will extend this to the $p$-dimensional case, demonstrating that the Gaussian distribution is closed under marginalization and conditioning. This second part is a little heavier on the mathematics, so if you just want to get an intuition you may focus on the first part and simply skip the second part. Let’s get started!</p>
<!-- The figure below shows the *contour lines* of a bivariate Gaussian distribution in blue. This distribution assigns each configuration of the two random variables $X_1$ and $X_2$, i.e., $(x_1, x_2)$, a density. We see that it is somewhat elliptic, which indicates a positive correlation between the variables $X_1$ and $X_2$; therefore, knowing $X_1$ tells us something about $X_2$. If we ignore this information and look at the *marginal* distribution of $X_1$ (the purple line), it looks like a perfectly normal distribution. If we incorporate or *condition* on the information that, in this case, $X_1 = 1.8$, however, we get the conditional distribution (black line). -->
<h1 id="the-land-of-the-gaussians">The Land of the Gaussians</h1>
<p>In the linear regression case discussed <a href="https://fdabl.github.io/statistics/Curve-Fitting-Gaussian.html">previously</a>, we have modeled each individual data point $y_i$ as coming from a <em>univariate conditional</em> Gaussian distribution with mean $\mu = x_i^Tb$ and variance $\sigma^2$. In this blog post, we introduce the random variables $X_1$ and $X_2$ and assume that both are <em>jointly</em> normally distributed; we are going from $p = 1$ to $p = 2$ dimensions. The probability density function changes accordingly — it becomes a function mapping from two to one dimension, i.e., $f: \mathbb{R}^2 \rightarrow \mathbb{R}^+$.</p>
<p>To simplify notation, let $\mathbf{x} = (x_1, x_2)^T$ and $\mathbf{\mu} = (\mu_1, \mu_2)^T$ be two 2-dimensional vectors denoting one observation and the population means, respectively. For simplicity, we set the population means to zero, i.e. $\mathbf{\mu} = (0, 0)$. In one dimension, we had just one parameter for the variance $\sigma^2$; in two dimensions, this becomes a symmetric $2 \times 2$ covariance matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma = \begin{pmatrix}
\sigma_1^2 & \rho \sigma_1 \sigma_2 \\
\rho \sigma_1\sigma_2 & \sigma_2^2
\end{pmatrix} \enspace , %]]></script>
<p>where $\sigma_1^2$ and $\sigma_2^2$ are the population variances of the random variables $X_1$ and $X_2$, respectively, and $\rho$ is the population correlation between the two. The general form of the density function of a $p$-dimensional Gaussian distribution is</p>
<script type="math/tex; mode=display">f(\mathbf{x} \mid \mathbf{\mu}, \Sigma) = (2\pi)^{-p/2} |\Sigma|^{-1/2} \exp \left(-\frac{1}{2} (\mathbf{x} - \mathbf{\mu})^T \Sigma^{-1}(\mathbf{x} - \mathbf{\mu}) \right) \enspace ,</script>
<p>where $\mathbf{x}$ and $\mathbf{\mu}$ are a $p$-dimensional vectors, $\Sigma^{-1}$ is the $(p \times p)$-dimensional inverse covariance matrix and $|\Sigma|$ is its determinant.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> We focus on the simpler 2-dimensional, zero-mean case. Observe that</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma^{-1} = \frac{1}{|\Sigma|} \begin{pmatrix}\sigma^2_2 & -\rho \sigma_1 \sigma_2 \\ -\rho \sigma_1 \sigma_2 & \sigma^2_1\end{pmatrix} = \frac{1}{\sigma^2_1 \sigma^2_2(1 - \rho^2)} \begin{pmatrix}\sigma^2_2 & -\rho \sigma_1 \sigma_2 \\ -\rho \sigma_1 \sigma_2 & \sigma^2_1\end{pmatrix} \enspace , %]]></script>
<p>which we use to expand the bivariate Gaussian density function:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(x, y \mid \sigma_1^2, \sigma_2^2, \rho) &= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix}^T \begin{pmatrix}\sigma^2_2 & -\rho \sigma_1 \sigma_2 \\ -\rho \sigma_1 \sigma_2 & \sigma^2_1\end{pmatrix} \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \right) \\[1em]
&= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \begin{pmatrix} x_1 \sigma^2_2 -x_2\rho \sigma_1 \sigma_2 \\ x_2 \sigma^2_1 -x_1\rho \sigma_1 \sigma_2 \end{pmatrix}^T \begin{pmatrix} x_1 \\ x_2 \end{pmatrix} \right) \\[1em]
&= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2\bigg] \right) \enspace .
\end{aligned} %]]></script>
<p>The figure below plots the <em>contour lines</em> of nine different bivariate normal distributions with mean zero, correlations $\rho \in [0, -0.3, 0.7]$, and standard deviations $\sigma_1, \sigma_2 \in [1, 2]$.</p>
<p><img src="/assets/img/2019-02-28-Two-Properties.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>In the top row, all bivariate Gaussian distributions have $\rho = 0$ and look like a circle for standard deviations of equal size. The top middle plot is stretched along $X_2$, giving it an elliptical shape. The middle and last row show how the distribution changes for negative ($\rho = -0.3$) and positive ($\rho = 0.7$) correlations.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<p>In the remainder of this blog post, we will take a closer look at two operations: marginalization and conditioning. Marginalizing means ignoring, and conditioning means incorporating information. In the zero-mean bivariate case, marginalizing out $X_2$ results in</p>
<script type="math/tex; mode=display">f(x_1) = \frac{1}{\sqrt{2\pi\sigma_1^2}} \text{exp} \left(-\frac{1}{2\sigma_1^2} x_1^2\right) \enspace ,</script>
<p>which is a simple univariate Gaussian distribution with mean $0$ and variance $\sigma_1^2$. On the other hand, incorporating the information that $X_2 = x_2$ results in</p>
<script type="math/tex; mode=display">f(x_1 \mid x_2) = \frac{1}{\sqrt{2\pi\sigma_1^2(1 - \rho^2)}} \text{exp} \left(-\frac{1}{2\sigma_1^2(1 - \rho^2)} \left(x_1 - \rho \frac{\sigma_1}{\sigma_2} x_2\right)^2\right) \enspace ,</script>
<p>which has mean $\rho \frac{\sigma_1}{\sigma_2} x_2$ and variance $\sigma_1^2 (1 - \rho^2)$. The next section provides two simple examples illustrating the difference between these two types of distributions, as well as a simple Shiny app that allows you to build an intuition for conditioning in the bivariate case.</p>
<h1 id="two-examples-and-a-shiny-app">Two examples and a Shiny app</h1>
<p>Let’s illustrate the difference between marginalization and conditioning on two simple examples. First, assume that the correlation is very high with $\rho = 0.8$, and that $\sigma_1^2 = \sigma_2^2 = 1$. Then, observing for example $X_2 = 2$, our belief about $X_1$ changes such that its mean gets shifted to the observed $x_2$ value, i.e. $\mu_1 = 0.8 \cdot 2 = 1.6$ (indicated by the dotted line in the Figure below). The variance of $x_1$ gets substantially reduced, from $1$ to $(1 - 0.8^2) = 0.36$. This is what the left part in the Figure below illustrates. If, on the other hand, $\rho = 0$ such that $X_1$ and $X_2$ are not related, then observing $X_2 = 2$ changes neither the mean of $X_1$ (it stays at zero), nor its variance (it stays at 1); see the right part of the Figure below. Note that the marginal and conditional densities are multiplied with a constant to make them better visible.</p>
<p><img src="/assets/img/2019-02-28-Two-Properties.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>To explore the relation between joint, marginal, and conditional Gaussian distributions, you can play around with a Shiny app following <a href="https://fdabl.shinyapps.io/two-properties/">this</a> link. In the remainder of the blog post, we will prove that the two distributions given above are in fact the marginal and conditional distributions in the two-dimensional case. We will also generalize these results to $p$-dimensional Gaussian distributions.</p>
<h1 id="the-two-rules-of-probability">The two rules of probability</h1>
<p>In the second part of this blog post, we need the two fundamental ‘rules’ of probability: the sum and the product rule.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup> The sum rule states that</p>
<script type="math/tex; mode=display">p(x) = \int p(x, y) \, \mathrm{d}y \enspace ,</script>
<p>and the product rule states that</p>
<script type="math/tex; mode=display">p(x, y) = p(x \mid y) \, p(y) = p(y \mid x) \, p(x) \enspace .</script>
<p>In the remainder, we will see that a joint Gaussian distribution can be factorized into a conditional Gaussian and a marginal Gaussian distribution.</p>
<h1 id="property-i-closed-under-marginalization">Property I: Closed under Marginalization</h1>
<p>The first property states that if we <em>marginalize out</em> variables in a multivariate Gaussian distribution, the result is still a Gaussian distribution. The Gaussian distribution is thus <em>closed under marginalization</em>. Below, I will show this for a bivariate Gaussian distribution directly, and for an arbitrary dimensional Gaussian distributions by thinking rather than computing. This illustrates that knowing your definitions can help avoid tedious calculations.</p>
<h2 id="2-dimensional-case">2-dimensional case</h2>
<p>To show that the marginalisation property holds for the bivariate Gaussian distribution, we need to solve the following integration problem</p>
<script type="math/tex; mode=display">\int_{X_2} f(x_1, x_2 \mid \sigma_1^2, \sigma_2^2, \rho) \, \mathrm{d} x_2 \enspace ,</script>
<p>and check whether the result is a univariate Gaussian distribution. We tackle the problem head on and expand</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&\int_{X_2} \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2\bigg] \right) \mathrm{d} x_2 \\[0.5em]
&= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \sigma_2^2 x_1^2\right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[\sigma_1^2 x_2^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2\bigg] \right) \mathrm{d} x_2 \\[1em]
&= \frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \sigma_2^2 x_1^2\right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \bigg[x_2^2 - 2\rho \frac{\sigma_2}{\sigma_1} x_1 x_2\bigg] \right) \mathrm{d} x_2 \enspace .
\end{aligned} %]]></script>
<p>Putting everything that does not involve $x_2$ outside the integral, we’ve come quite far! Note that we can “complete the square”, that is, write</p>
<script type="math/tex; mode=display">x_2^2 - 2\rho\frac{\sigma_2}{\sigma_1} x_1 x_2 = \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 - \rho^2\frac{\sigma_2^2}{\sigma_1^2} x_1^2 \enspace .</script>
<p>This leads to</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \sigma_2^2 x_1^2\right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \bigg[\left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 - \rho^2\frac{\sigma_2^2}{\sigma_1^2} x_1^2\bigg] \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \sigma_2^2 x_1^2 + \frac{1}{2\sigma^2_2(1 - \rho^2)} \rho^2\frac{\sigma_2^2}{\sigma_1^2} x_1^2 \right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1(1 - \rho^2)} x_1^2 + \frac{1}{2\sigma^2_1(1 - \rho^2)} \rho^2 x_1^2 \right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{x_1^2 - \rho^2 x_1^2}{2\sigma^2_1(1 - \rho^2)} \right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{x_1^2 (1 - \rho^2)}{2\sigma^2_1(1 - \rho^2)} \right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \\[1em]
=&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1} x_1^2\right) \int_{X_2} \exp \left(-\frac{1}{2\sigma^2_2(1 - \rho^2)} \left(x_2 - \rho\frac{\sigma_2}{\sigma_1} x_1\right)^2 \right) \mathrm{d} x_2 \enspace .
\end{aligned} %]]></script>
<p>We are nearly done! What’s left is to realize that the integrand is the <em>kernel</em> of a univariate Gaussian distribution with mean $\rho \frac{\sigma_2}{\sigma_1} x_1$ and variance $\sigma_2^2 (1 - \rho^2)$ — it’s an unnormalized <em>conditional</em> Gaussian distribution! The thing that makes a Gaussian distribution integrate to 1, as all distributions must, is the normalizing constant in front, the strange term involving $\pi$. For this particular distribution, the normalizing constant is $\sqrt{2\pi \sigma_2^2 (1 - \rho^2)}$.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p>Continuing, we arrive at the solution</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1} x_1^2\right) \sqrt{2\pi \sigma_2^2 (1 - \rho^2)} \\[0.5em]
&= \frac{1}{\sqrt{2\pi\sigma_1^2}} \exp \left(-\frac{1}{2\sigma^2_1} x_1^2\right) \enspace ,
\end{aligned} %]]></script>
<p>which is the density function of the univariate Gaussian distribution (with mean zero). With some work, we have shown that marginalizing out a variable in a bivariate Gaussian distribution leads to a univariate Gaussian distribution. This process ‘removes’ any occurances of the correlation $\rho$ and the other variable $x_2$.</p>
<p>Granted, this process was rather tedious and not at all general (but good practice!) — does it also work when going from 3 to 2 dimensions? Will the remaining bivariate distribution be Gaussian? What if we go from 200 dimensions to 97 dimensions?</p>
<h2 id="p-dimensional-case">$p$-dimensional case</h2>
<p>A more elegant way to see that a $p$-dimensional Gaussian distribution is closed under marginalization is the following. First, we note the requirement that a random variable needs to fulfill in order to have a (multivariate) Gaussian distribution.</p>
<p><em>Definition.</em> $\mathbf{X} = (X_1, \ldots, X_p)^T$ has a multivariate Gaussian distribution if every linear combination of its components has a (multivariate) Gaussian distribution. Formally,</p>
<script type="math/tex; mode=display">\mathbf{X} \sim \mathcal{N}(\mu, \Sigma) \,\,\,\, \text{if and only if} \,\,\,\, A\mathbf{X} \sim \mathcal{N}(A\mu, A\Sigma A^T) \enspace ,</script>
<p>see for example Blitzstein & Hwang (<a href="https://projects.iq.harvard.edu/stat110">2014</a>, pp. 309-310).</p>
<p>Second, from this it immediately follows that any subset of random variables $H \subset X$ are themselves normally distributed, and the mean and covariance is given by simply ignoring all elements that are not in $H$; this is called the <em>marginalisation</em> property. In particular, we choose a linear transformation that simply ignores the components we want to marginalize out. As an example, let’s take the <em>trivariate</em> Gaussian distribution</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
X_1 \\
X_2 \\
X_3 \\
\end{pmatrix} \sim \mathcal{N}\left(
\begin{pmatrix}
\mu_1 \\
\mu_2 \\
\mu_3 \\
\end{pmatrix},
\begin{pmatrix}
\sigma_1^2 & & \\
\rho_{12}\sigma_1\sigma_2 & \sigma_2^2 & \\
\rho_{13}\sigma_1\sigma_3 & \rho_{23}\sigma_2\sigma_3 & \sigma_3^2 \\
\end{pmatrix}
\right) \enspace , %]]></script>
<p>which has a three-dimensional mean vector and adds a variance and two correlations to the (symmetric) covariance matrix. Define</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix} \enspace , %]]></script>
<p>which picks out the components $X_1$ and $X_2$ and ignores $X_3$. Putting this into the equality from the definition, we arrive at</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix}
\begin{pmatrix}
X_1 \\
X_2 \\
X_3
\end{pmatrix} &\sim \mathcal{N}\left(
\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix}
\begin{pmatrix}
\mu_1 \\
\mu_2 \\
\mu_3
\end{pmatrix},
\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix}
\begin{pmatrix}
\sigma_1^2 & & \\
\rho_{12}\sigma_1\sigma_2 & \sigma_2^2 & \\
\rho_{13}\sigma_1\sigma_3 & \rho_{23}\sigma_2\sigma_3 & \sigma_3^2 \\
\end{pmatrix}
\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0\end{pmatrix}
\right) \\[1em]
\begin{pmatrix}
X_1 \\
X_2
\end{pmatrix} &\sim \mathcal{N}\left(
\begin{pmatrix}
\mu_1 \\
\mu_2
\end{pmatrix},
\begin{pmatrix}
\sigma_1^2 & \\
\rho_{12}\sigma_1\sigma_2 & \sigma_2^2
\end{pmatrix}
\right) \enspace .
\end{aligned} %]]></script>
<p>Two points to wrap up. First, it helps to know your definitions. Second, in the Gaussian case, computing marginal distributions is trivial. Conditional distributions are a bit harder, unfortunately. But not by much.</p>
<!-- # Property II: Conditionals are Gaussians -->
<h1 id="property-ii-closed-under-conditioning">Property II: Closed under Conditioning</h1>
<p>Conditioning means incorporating information. The fact that Gaussian distributions are closed under conditioning means that, if we start with a Gaussian distribution and update our knowledge given the observed value of one of its components, then the resulting distribution is still Gaussian — we never have to leave the wonderful land of the Gaussians! In the following, we prove this first for the simple bivariate case, which should also give some intuition as to how conditioning differs from marginalizing, and then provide the more general expression for $p$ dimensions.</p>
<p>Instead of ignoring information, as we did when computing marginal distributions above, we now want to incorporate information we have about the other random variable $X_2$. Conditioning implies <em>learning</em>: how does our knowledge that $X_2 = x_2$ change our knowledge about $X_1$?</p>
<h2 id="2-dimensional-case-1">2-dimensional case</h2>
<p>Let’s say we observe $X_2 = x_2$. How does that change our beliefs about $X_1$? The product rule above leads to Bayes’ rule (via simple division), which is exactly what we need:</p>
<script type="math/tex; mode=display">f(x_1 \mid x_2) = \frac{f(x_1, x_2)}{f(x_2)} \enspace ,</script>
<p>where we have suppressed conditioning on the parameters $\rho, \sigma_1^2, \sigma_2^2$ to avoid cluttered notation. Let’s do some algebra! We write</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(x_1 \mid x_2) &= \frac{\frac{1}{2\pi \sqrt{\sigma_1^2 \sigma_2^2(1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2\bigg] \right)}{\frac{1}{\sqrt{2\pi\sigma_2^2}} \exp \left( -\frac{1}{2\sigma_2^2} x_2^2\right)} \\[1em]
&= \frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2\bigg] + \frac{1}{2\sigma_2^2} x_2^2 \right) \enspace ,
\end{aligned} %]]></script>
<p>which already looks promising. Putting the $x_2^2$ term into the angular brackets, we should see a nice quadratic formula:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2 - \frac{2\sigma^2_1 \sigma^2_2(1 - \rho^2)}{2\sigma_2^2} x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2 - \sigma^2_1 (1 - \rho^2) x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma_1^2 x_2^2 - \sigma_1^2 x_2^2 + \sigma_1^2 \rho^2 x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 \sigma^2_2(1 - \rho^2)} \bigg[ \sigma_2^2 x_1^2 - 2\rho \sigma_1 \sigma_2 x_1 x_2 + \sigma^2_1 \rho^2 x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 (1 - \rho^2)} \bigg[x_1^2 - 2\rho \frac{\sigma_1}{\sigma_2} x_1 x_2 + \frac{\sigma^2_1}{\sigma^2_2 } \rho^2 x_2^2 \bigg] \right) \\[1em]
&=\frac{1}{\sqrt{2\pi \sigma_1^2 (1 - \rho^2)}} \exp \left(-\frac{1}{2\sigma^2_1 (1 - \rho^2)} \left(x_1 - \rho \frac{\sigma_1}{\sigma_2} x_2 \right)^2 \right) \enspace .
\end{aligned} %]]></script>
<p>Done! The conditional distribution has a mean of $\rho \frac{\sigma_1}{\sigma_2}x_2$ and a variance of $\sigma_1^2(1 - \rho^2)$. How does this look like in $p$ dimensions?</p>
<h2 id="p-dimensional-case-1">$p$-dimensional case</h2>
<p>We need a little bit more notation for the crazy ride we’re about to embark on. Let $\mathbf{x} = (x_1, \ldots, x_n)^T$ be an $n$-dimensional vector and $\mathbf{y} = (y_1, \ldots, y_m)^T$ an $m$-dimensional vector which both are jointly Gaussian distributed with covariance matrix $\Sigma \in \mathbb{R}^{(n + m) \times (n + m)}$. Note that we can write $\Sigma$ as a block matrix, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma = \begin{pmatrix}
\Sigma_{xx} & \Sigma_{xy} \\
\Sigma_{yx} & \Sigma_{yy}
\end{pmatrix} \enspace , %]]></script>
<p>where $\Sigma_{xx}$ and $\Sigma_{yy}$ are the covariance matrices of $\mathbf{x}$ and $\mathbf{y}$, respectively, and $\Sigma_{xy} = (\Sigma_{yx})^T$ gives the covariance between $\mathbf{x}$ and $\mathbf{y}$. We remember the density function of a multivariate Gaussian distribution from above, and take a first stab:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(\mathbf{x} \mid \mathbf{y}) &= \frac{f(\mathbf{x}, \mathbf{y})}{f(\mathbf{y})}\\[.5em]
&= \frac{(2\pi)^{-(n + m) / 2} |\Sigma|^{-1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}^T \Sigma^{-1} \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}\right]\right)}{(2\pi)^{-m/2}|\Sigma_{yy}|^{-1/2}\text{exp} \left(-\frac{1}{2} \left[\mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right)} \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}^T \begin{pmatrix}
\Sigma_{xx} & \Sigma_{xy} \\
\Sigma_{yx} & \Sigma_{yy}
\end{pmatrix}^{-1} \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}\right] + \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}^T \begin{pmatrix}
\Sigma_{xx} & \Sigma_{xy} \\
\Sigma_{yx} & \Sigma_{yy}
\end{pmatrix}^{-1} \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix} - \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right) \enspace .
\end{aligned} %]]></script>
<p>There’s only a slight problem. The inverse of the block matrix is pretty <a href="https://en.wikipedia.org/wiki/Invertible_matrix#Blockwise_inversion">ugly</a>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Sigma^{-1} = \begin{pmatrix}
\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx} \right)^{-1} & -\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1} \\
-\Sigma_{yy}^{-1}\Sigma_{yx}\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1} & \Sigma_{yy}^{-1} + \Sigma_{yy}^{-1}\Sigma{yx}\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}
\end{pmatrix} \enspace , %]]></script>
<p>where $\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}$ is the <a href="https://en.wikipedia.org/wiki/Schur_complement">Schur complement</a> of $\Sigma_{xx}$ in the block matrix above. Let’s be lazy and delay computation by simply renaming the relevant parts, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
\Omega = \begin{pmatrix}
\Omega_{xx} & \Omega_{xy} \\
\Omega_{yx} & \Omega_{yy}
\end{pmatrix} = \Sigma^{-1} \enspace . %]]></script>
<p>Proceeding bravely, we write</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(\mathbf{x} \mid \mathbf{y}) &= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix}^T \begin{pmatrix}
\Omega_{xx} & \Omega_{xy} \\
\Omega_{yx} & \Omega_{yy}
\end{pmatrix} \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix} - \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[\begin{pmatrix}
\mathbf{x}^T \Omega_{xx} + \mathbf{y}^T \Omega_{yx} \\ \mathbf{x}^T \Omega_{xy} + \mathbf{y}^T \Omega_{yy}
\end{pmatrix}^T \begin{pmatrix}\mathbf{x} \\ \mathbf{y}\end{pmatrix} - \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[
\mathbf{x}^T \Omega_{xx} \mathbf{x} + \mathbf{y}^T \Omega_{yx} \mathbf{x} + \mathbf{x}^T \Omega_{xy} \mathbf{y} + \mathbf{y}^T \Omega_{yy} \mathbf{y} - \mathbf{y}^T \Sigma_{yy}^{-1} \mathbf{y}\right]\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2} \left[
\mathbf{x}^T \Omega_{xx} \mathbf{x} + 2\mathbf{x}^T \Omega_{xy} \mathbf{y} + \mathbf{y}^T \left(\Omega_{yy} - \Sigma_{yy}^{-1}\right) \mathbf{y}\right]\right) \enspace ,
\end{aligned} %]]></script>
<p>where we get the last line by noting that $\mathbf{y}^T \Omega_{yx} \mathbf{x} = \left(\mathbf{x}^T \Omega_{xy} \mathbf{y}\right)^T$, i.e. they give the same scalar. It is also important to keep in mind that $ \Omega_{yy} \neq \Sigma_{yy}^{-1}$.</p>
<p>There is hope: we are in an analogous situation as in the two-dimensional case described above. Somehow we must be able to ‘‘complete the square’’ in the more general $p$-dimensional case, too.</p>
<p>Scribbling on paper for a bit, we dare to conjecture that the conditional distribution is</p>
<script type="math/tex; mode=display">f(\mathbf{x} \mid \mathbf{y}) = (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2}
\left(\mathbf{x} + \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y}\right)^T\Omega_{xx}\left(\mathbf{x} + \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} \right)\right) \enspace ,</script>
<p>which expands into</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2}
\left[\mathbf{x}^T\Omega_{xx}\mathbf{x} + \mathbf{x}^T \Omega_{xx} \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} + \mathbf{y}^T \Omega_{xy}^{T} \Omega_{xx}^{-T} \Omega_{xx}\mathbf{x} + \mathbf{y}^T \Omega_{xy}^T \Omega_{xx}^{-T} \Omega_{xx} \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} \right]\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2}
\left[\mathbf{x}^T\Omega_{xx}\mathbf{x} + 2\mathbf{x}^T \Omega_{xy} \mathbf{y} + \mathbf{y}^T \Omega_{yx} \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} \right]\right) \enspace .
\end{aligned} %]]></script>
<p>For our conjecture to be true, it must hold that</p>
<script type="math/tex; mode=display">\Omega_{yy} - \Sigma_{yy}^{-1} = \Omega_{yx} \Omega_{xx}^{-1} \Omega_{xy} \enspace .</script>
<p>Indeed, remember that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\Sigma^{-1} &= \begin{pmatrix}
\Omega_{xx} & \Omega_{xy} \\
\Omega_{yx} & \Omega_{yy}
\end{pmatrix} \\[1em]
&= \begin{pmatrix}
\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx} \right)^{-1} & -\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1} \\
-\Sigma_{yy}^{-1}\Sigma_{yx}\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1} & \Sigma_{yy}^{-1} + \Sigma_{yy}^{-1}\Sigma{yx}\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}
\end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>and therefore</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\Omega_{yx} \Omega_{xx}^{-1} \Omega_{xy} &= -\Sigma_{yy}^{-1}\Sigma_{yx}\overbrace{\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}
\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}\right)}^{I} \left(
-\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}\right) \\[1em]
&= -\Sigma_{yy}^{-1}\Sigma_{yx} \left(-\left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}\right) \\[1em]
&= \Sigma_{yy}^{-1}\Sigma_{yx} \left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1} \\[1em]
&= \underbrace{\Sigma_{yy}^{-1} + \Sigma_{yy}^{-1}\Sigma_{yx} \left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}\right)^{-1}\Sigma_{xy}\Sigma_{yy}^{-1}}_{\Omega_{yy}} - \Sigma_{yy}^{-1} \enspace .
\end{aligned} %]]></script>
<p>This means that we have correctly completed the square! To clean up the business of the determinants, note that the determinant of a block matrix <a href="https://en.wikipedia.org/wiki/Determinant#Block_matrices">factors</a> such that</p>
<script type="math/tex; mode=display">|\Sigma| = |\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}| \times |\Sigma_{yy}| \enspace .</script>
<p>Substituting this into our equation for the conditional density, as well as substituting all the $\Omega$’s with $\Sigma$’s, results in</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f(\mathbf{x} \mid \mathbf{y}) &= (2\pi)^{-n/2} |\Sigma|^{-1/2} |\Sigma_{yy}|^{1/2} \text{exp} \left(-\frac{1}{2}
\left(\mathbf{x} + \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y}\right)^T\Omega_{xx}\left(\mathbf{x} + \Omega_{xx}^{-1} \Omega_{xy} \mathbf{y} \right)\right) \\[1em]
&= (2\pi)^{-n/2} |\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}|^{-1/2} \text{exp} \left(-\frac{1}{2}
\left(\mathbf{x} - \Sigma_{xy} \Sigma_{yy}^{-1} \mathbf{y}\right)^T \left(\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}\right)^{-1}\left(\mathbf{x} - \Sigma_{xy} \Sigma_{yy}^{-1} \mathbf{y}\right)\right) \enspace .
\end{aligned} %]]></script>
<p>Thus, if $(\mathbf{x}, \mathbf{y})$ are jointly normally distributed, then incorporating the information that $Y = \mathbf{y}$ leads to a conditional distribution $f(\mathbf{x} \mid \mathbf{y})$ that is Gaussian with conditional mean $\Sigma_{xy} \Sigma_{yy}^{-1} \mathbf{y}$ and conditional covariance $\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}$.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen that the Gaussian distribution has two important properties: it is closed under (a) <em>marginalization</em> and (b) <em>conditioning</em>. For the bivariate case, an accompanying <a href="https://fdabl.shinyapps.io/two-properties/">Shiny app</a> hopefully helped to build some intuition about the difference between these two operations.</p>
<p>For the general $p$-dimensional case, we noted that a random variable <em>per definition</em> follows a (multivariate) Gaussian distribution if and only if every linear combination of its components follows a Gaussian distribution. This made it obvious that the Gaussian distribution is closed under marginalization — we simply ignore the components we want to marginalize over in the linear combination.</p>
<p>To show that an arbitrary dimensional Gaussian distribution is closed under conditioning, we had to rely on a mathematical trick called ‘‘completing the square’’, as well as certain properties of matrices few mortals can remember. In conclusion, I think we should celebrate the fact that frequent operations such as marginalizing and conditioning do not expel us from the wonderful land of the Gaussians.<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></p>
<hr />
<p><em>I would like to thank Don van den Bergh and Sophia Crüwell for helpful comments on this blogpost.</em></p>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>$\Sigma^{-1}$ is the main object of interest in Gaussian graphical models. This is because of another special property of the Gaussian: if the off-diagonal element $(i, j)$ in $\Sigma^{-1}$ is zero, then the variables $X_i$ and $X_j$ are <em>conditionally independent</em> given all the other variables — there is no edge between those two variables in the graph. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>You might enjoy training your intuitions about correlations on <a href="http://guessthecorrelation.com/">http://guessthecorrelation.com/</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>See also Dennis Lindley’s paper <a href="https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9884.00238"><em>The philosophy of statistics</em></a>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>This follows from the Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2} = \sqrt{\pi}$, see <a href="https://en.wikipedia.org/wiki/Gaussian_integral">here</a>. For more on why $\pi$ and $e$ feature in the Gaussian density, see <a href="https://math.stackexchange.com/questions/28558/what-do-pi-and-e-stand-for-in-the-normal-distribution-formula">this</a>. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The land of the Gaussians is vast: its inhabitants — all Gaussian distributions — are also closed under multiplication and convolution. This might make for a future blog post. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderIn a previous blog post, we looked at the history of least squares, how Gauss justified it using the Gaussian distribution, and how Laplace justified the Gaussian distribution using the central limit theorem. The Gaussian distribution has a number of special properties which distinguish it from other distributions and which make it easy to work with mathematically. In this blog post, I will focus on two of these properties: being closed under (a) marginalization and (b) conditioning. This means that, if one starts with a $p$-dimensional Gaussian distribution and marginalizes out or conditions on one or more of its components, the resulting distribution will still be Gaussian. This blog post has two parts. First, I will introduce the joint, marginal, and conditional Gaussian distributions for the case of two random variables; an interactive Shiny app illustrates the differences between them. Second, I will show mathematically that the marginal and conditional distribution do indeed have the form I presented in the first part. I will extend this to the $p$-dimensional case, demonstrating that the Gaussian distribution is closed under marginalization and conditioning. This second part is a little heavier on the mathematics, so if you just want to get an intuition you may focus on the first part and simply skip the second part. Let’s get started! The Land of the Gaussians In the linear regression case discussed previously, we have modeled each individual data point $y_i$ as coming from a univariate conditional Gaussian distribution with mean $\mu = x_i^Tb$ and variance $\sigma^2$. In this blog post, we introduce the random variables $X_1$ and $X_2$ and assume that both are jointly normally distributed; we are going from $p = 1$ to $p = 2$ dimensions. The probability density function changes accordingly — it becomes a function mapping from two to one dimension, i.e., $f: \mathbb{R}^2 \rightarrow \mathbb{R}^+$. To simplify notation, let $\mathbf{x} = (x_1, x_2)^T$ and $\mathbf{\mu} = (\mu_1, \mu_2)^T$ be two 2-dimensional vectors denoting one observation and the population means, respectively. For simplicity, we set the population means to zero, i.e. $\mathbf{\mu} = (0, 0)$. In one dimension, we had just one parameter for the variance $\sigma^2$; in two dimensions, this becomes a symmetric $2 \times 2$ covariance matrix where $\sigma_1^2$ and $\sigma_2^2$ are the population variances of the random variables $X_1$ and $X_2$, respectively, and $\rho$ is the population correlation between the two. The general form of the density function of a $p$-dimensional Gaussian distribution is where $\mathbf{x}$ and $\mathbf{\mu}$ are a $p$-dimensional vectors, $\Sigma^{-1}$ is the $(p \times p)$-dimensional inverse covariance matrix and $|\Sigma|$ is its determinant.1 We focus on the simpler 2-dimensional, zero-mean case. Observe that which we use to expand the bivariate Gaussian density function: The figure below plots the contour lines of nine different bivariate normal distributions with mean zero, correlations $\rho \in [0, -0.3, 0.7]$, and standard deviations $\sigma_1, \sigma_2 \in [1, 2]$. In the top row, all bivariate Gaussian distributions have $\rho = 0$ and look like a circle for standard deviations of equal size. The top middle plot is stretched along $X_2$, giving it an elliptical shape. The middle and last row show how the distribution changes for negative ($\rho = -0.3$) and positive ($\rho = 0.7$) correlations.2 In the remainder of this blog post, we will take a closer look at two operations: marginalization and conditioning. Marginalizing means ignoring, and conditioning means incorporating information. In the zero-mean bivariate case, marginalizing out $X_2$ results in which is a simple univariate Gaussian distribution with mean $0$ and variance $\sigma_1^2$. On the other hand, incorporating the information that $X_2 = x_2$ results in which has mean $\rho \frac{\sigma_1}{\sigma_2} x_2$ and variance $\sigma_1^2 (1 - \rho^2)$. The next section provides two simple examples illustrating the difference between these two types of distributions, as well as a simple Shiny app that allows you to build an intuition for conditioning in the bivariate case. Two examples and a Shiny app Let’s illustrate the difference between marginalization and conditioning on two simple examples. First, assume that the correlation is very high with $\rho = 0.8$, and that $\sigma_1^2 = \sigma_2^2 = 1$. Then, observing for example $X_2 = 2$, our belief about $X_1$ changes such that its mean gets shifted to the observed $x_2$ value, i.e. $\mu_1 = 0.8 \cdot 2 = 1.6$ (indicated by the dotted line in the Figure below). The variance of $x_1$ gets substantially reduced, from $1$ to $(1 - 0.8^2) = 0.36$. This is what the left part in the Figure below illustrates. If, on the other hand, $\rho = 0$ such that $X_1$ and $X_2$ are not related, then observing $X_2 = 2$ changes neither the mean of $X_1$ (it stays at zero), nor its variance (it stays at 1); see the right part of the Figure below. Note that the marginal and conditional densities are multiplied with a constant to make them better visible. To explore the relation between joint, marginal, and conditional Gaussian distributions, you can play around with a Shiny app following this link. In the remainder of the blog post, we will prove that the two distributions given above are in fact the marginal and conditional distributions in the two-dimensional case. We will also generalize these results to $p$-dimensional Gaussian distributions. The two rules of probability In the second part of this blog post, we need the two fundamental ‘rules’ of probability: the sum and the product rule.3 The sum rule states that and the product rule states that In the remainder, we will see that a joint Gaussian distribution can be factorized into a conditional Gaussian and a marginal Gaussian distribution. Property I: Closed under Marginalization The first property states that if we marginalize out variables in a multivariate Gaussian distribution, the result is still a Gaussian distribution. The Gaussian distribution is thus closed under marginalization. Below, I will show this for a bivariate Gaussian distribution directly, and for an arbitrary dimensional Gaussian distributions by thinking rather than computing. This illustrates that knowing your definitions can help avoid tedious calculations. 2-dimensional case To show that the marginalisation property holds for the bivariate Gaussian distribution, we need to solve the following integration problem and check whether the result is a univariate Gaussian distribution. We tackle the problem head on and expand Putting everything that does not involve $x_2$ outside the integral, we’ve come quite far! Note that we can “complete the square”, that is, write This leads to We are nearly done! What’s left is to realize that the integrand is the kernel of a univariate Gaussian distribution with mean $\rho \frac{\sigma_2}{\sigma_1} x_1$ and variance $\sigma_2^2 (1 - \rho^2)$ — it’s an unnormalized conditional Gaussian distribution! The thing that makes a Gaussian distribution integrate to 1, as all distributions must, is the normalizing constant in front, the strange term involving $\pi$. For this particular distribution, the normalizing constant is $\sqrt{2\pi \sigma_2^2 (1 - \rho^2)}$.4 Continuing, we arrive at the solution which is the density function of the univariate Gaussian distribution (with mean zero). With some work, we have shown that marginalizing out a variable in a bivariate Gaussian distribution leads to a univariate Gaussian distribution. This process ‘removes’ any occurances of the correlation $\rho$ and the other variable $x_2$. Granted, this process was rather tedious and not at all general (but good practice!) — does it also work when going from 3 to 2 dimensions? Will the remaining bivariate distribution be Gaussian? What if we go from 200 dimensions to 97 dimensions? $p$-dimensional case A more elegant way to see that a $p$-dimensional Gaussian distribution is closed under marginalization is the following. First, we note the requirement that a random variable needs to fulfill in order to have a (multivariate) Gaussian distribution. Definition. $\mathbf{X} = (X_1, \ldots, X_p)^T$ has a multivariate Gaussian distribution if every linear combination of its components has a (multivariate) Gaussian distribution. Formally, see for example Blitzstein & Hwang (2014, pp. 309-310). Second, from this it immediately follows that any subset of random variables $H \subset X$ are themselves normally distributed, and the mean and covariance is given by simply ignoring all elements that are not in $H$; this is called the marginalisation property. In particular, we choose a linear transformation that simply ignores the components we want to marginalize out. As an example, let’s take the trivariate Gaussian distribution which has a three-dimensional mean vector and adds a variance and two correlations to the (symmetric) covariance matrix. Define which picks out the components $X_1$ and $X_2$ and ignores $X_3$. Putting this into the equality from the definition, we arrive at Two points to wrap up. First, it helps to know your definitions. Second, in the Gaussian case, computing marginal distributions is trivial. Conditional distributions are a bit harder, unfortunately. But not by much. Property II: Closed under Conditioning Conditioning means incorporating information. The fact that Gaussian distributions are closed under conditioning means that, if we start with a Gaussian distribution and update our knowledge given the observed value of one of its components, then the resulting distribution is still Gaussian — we never have to leave the wonderful land of the Gaussians! In the following, we prove this first for the simple bivariate case, which should also give some intuition as to how conditioning differs from marginalizing, and then provide the more general expression for $p$ dimensions. Instead of ignoring information, as we did when computing marginal distributions above, we now want to incorporate information we have about the other random variable $X_2$. Conditioning implies learning: how does our knowledge that $X_2 = x_2$ change our knowledge about $X_1$? 2-dimensional case Let’s say we observe $X_2 = x_2$. How does that change our beliefs about $X_1$? The product rule above leads to Bayes’ rule (via simple division), which is exactly what we need: where we have suppressed conditioning on the parameters $\rho, \sigma_1^2, \sigma_2^2$ to avoid cluttered notation. Let’s do some algebra! We write which already looks promising. Putting the $x_2^2$ term into the angular brackets, we should see a nice quadratic formula: Done! The conditional distribution has a mean of $\rho \frac{\sigma_1}{\sigma_2}x_2$ and a variance of $\sigma_1^2(1 - \rho^2)$. How does this look like in $p$ dimensions? $p$-dimensional case We need a little bit more notation for the crazy ride we’re about to embark on. Let $\mathbf{x} = (x_1, \ldots, x_n)^T$ be an $n$-dimensional vector and $\mathbf{y} = (y_1, \ldots, y_m)^T$ an $m$-dimensional vector which both are jointly Gaussian distributed with covariance matrix $\Sigma \in \mathbb{R}^{(n + m) \times (n + m)}$. Note that we can write $\Sigma$ as a block matrix, i.e., where $\Sigma_{xx}$ and $\Sigma_{yy}$ are the covariance matrices of $\mathbf{x}$ and $\mathbf{y}$, respectively, and $\Sigma_{xy} = (\Sigma_{yx})^T$ gives the covariance between $\mathbf{x}$ and $\mathbf{y}$. We remember the density function of a multivariate Gaussian distribution from above, and take a first stab: There’s only a slight problem. The inverse of the block matrix is pretty ugly: where $\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{xy}$ is the Schur complement of $\Sigma_{xx}$ in the block matrix above. Let’s be lazy and delay computation by simply renaming the relevant parts, i.e., Proceeding bravely, we write where we get the last line by noting that $\mathbf{y}^T \Omega_{yx} \mathbf{x} = \left(\mathbf{x}^T \Omega_{xy} \mathbf{y}\right)^T$, i.e. they give the same scalar. It is also important to keep in mind that $ \Omega_{yy} \neq \Sigma_{yy}^{-1}$. There is hope: we are in an analogous situation as in the two-dimensional case described above. Somehow we must be able to ‘‘complete the square’’ in the more general $p$-dimensional case, too. Scribbling on paper for a bit, we dare to conjecture that the conditional distribution is which expands into For our conjecture to be true, it must hold that Indeed, remember that and therefore This means that we have correctly completed the square! To clean up the business of the determinants, note that the determinant of a block matrix factors such that Substituting this into our equation for the conditional density, as well as substituting all the $\Omega$’s with $\Sigma$’s, results in Thus, if $(\mathbf{x}, \mathbf{y})$ are jointly normally distributed, then incorporating the information that $Y = \mathbf{y}$ leads to a conditional distribution $f(\mathbf{x} \mid \mathbf{y})$ that is Gaussian with conditional mean $\Sigma_{xy} \Sigma_{yy}^{-1} \mathbf{y}$ and conditional covariance $\Sigma_{xx} - \Sigma_{xy}\Sigma_{yy}^{-1}\Sigma_{yx}$. Conclusion In this blog post, we have seen that the Gaussian distribution has two important properties: it is closed under (a) marginalization and (b) conditioning. For the bivariate case, an accompanying Shiny app hopefully helped to build some intuition about the difference between these two operations. For the general $p$-dimensional case, we noted that a random variable per definition follows a (multivariate) Gaussian distribution if and only if every linear combination of its components follows a Gaussian distribution. This made it obvious that the Gaussian distribution is closed under marginalization — we simply ignore the components we want to marginalize over in the linear combination. To show that an arbitrary dimensional Gaussian distribution is closed under conditioning, we had to rely on a mathematical trick called ‘‘completing the square’’, as well as certain properties of matrices few mortals can remember. In conclusion, I think we should celebrate the fact that frequent operations such as marginalizing and conditioning do not expel us from the wonderful land of the Gaussians.5 I would like to thank Don van den Bergh and Sophia Crüwell for helpful comments on this blogpost. Footnotes $\Sigma^{-1}$ is the main object of interest in Gaussian graphical models. This is because of another special property of the Gaussian: if the off-diagonal element $(i, j)$ in $\Sigma^{-1}$ is zero, then the variables $X_i$ and $X_j$ are conditionally independent given all the other variables — there is no edge between those two variables in the graph. ↩ You might enjoy training your intuitions about correlations on http://guessthecorrelation.com/. ↩ See also Dennis Lindley’s paper The philosophy of statistics. ↩ This follows from the Gaussian integral $\int_{-\infty}^{\infty} e^{-x^2} = \sqrt{\pi}$, see here. For more on why $\pi$ and $e$ feature in the Gaussian density, see this. ↩ The land of the Gaussians is vast: its inhabitants — all Gaussian distributions — are also closed under multiplication and convolution. This might make for a future blog post. ↩Curve fitting and the Gaussian distribution2019-01-11T16:30:00+00:002019-01-11T16:30:00+00:00https://fabiandablander.com/r/Curve-Fitting-Gaussian<p>Judea Pearl said that much of machine learning is just curve fitting<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> — but it is quite impressive how far you can get with that, isn’t it? In this blog post, we will look at the mother of all curve fitting problems: fitting a straight line to a number of points. In doing so, we will engage in some statistical detective work and discover the methods of least squares as well as the Gaussian distribution.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup></p>
<h2 id="fitting-a-line">Fitting a line</h2>
<p>A straight line in the Euclidean plane is described by an intercept (<script type="math/tex">b_0</script>) and a slope ($b_1$), i.e.,</p>
<script type="math/tex; mode=display">y = b_0 + b_1x \enspace .</script>
<p>We are interested in finding the values for $(b_0, b_1)$, and so we must collect data points $d = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$. Data collection is often tedious, so let’s do it one point at a time. The first point is $P_1 = (x_1, y_1) = (1, 2)$, and if we plug the values into the equation for the line (i.e., set $x_1 = 1$ and $y_1 = 2$), we get</p>
<script type="math/tex; mode=display">2 = b_0 + 1b_1 \enspace ,</script>
<div style="float: left; padding: 10px 10px 10px 0px;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-1-1.png" title="Fitting a line with only one point (underdetermined)." alt="Fitting a line with only one point (underdetermined)." style="display: block; margin: auto auto auto 0;" />
</div>
<p>an equation with two unknowns. We call this system of equations <em>underdetermined</em> because we cannot uniquely solve for $b_0$ and $b_1$, but we will have a number of solutions all for which $b_1 = 2 - b_0$; see Figure 1 on the left. However, if we add another point $P_2 = (3, 1)$, the resulting system of equations becomes</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
2 &= b_0 + 1b_1 \\
1 &= b_0 + 3b_1 \enspace .
\end{aligned} %]]></script>
<p>We have two equations in two unknowns, and this is <em>determined</em> or <em>identified</em>: there is a unique solution for $b_0$ and $b_1$. After some rearranging, we find $b_1 = -0.5$ and $b_0 = 2.5$. This specifies exactly one line, as you can see in Figure 2 on the right.</p>
<div style="float: right; padding: 10px 0px 10px 10px">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-2-1.png" title="Fitting a line with two points (determined)." alt="Fitting a line with two points (determined)." style="display: block; margin: auto 0 auto auto;" />
</div>
<p>We could end the blog post here, but that would not be particularly insightful for data analysis problems in which we have more data points. Thus, let’s see where it takes us when we add another point, $P_3 = (2, 2)$. The resulting system of equations becomes</p>
<p><script type="math/tex">% <![CDATA[
\begin{aligned}
2 &= b_0 + 1b_1\\
1 &= b_0 + 3b_1 \\
2 &= b_0 + 2b_1 \enspace ,
\end{aligned} %]]></script></p>
<div style="float: left;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-3-1.png" title="Fitting lines with three points (overdetermined)." alt="Fitting lines with three points (overdetermined)." style="display: block; margin: auto 0 auto auto;" />
</div>
<p>which is <em>overdetermined</em> — we cannot fit a line that passes through all three of these points. We can, for example, fit three <em>separate</em> lines, given by two out of three of the equations; see Figure 3. But which of these lines, if any, is the “best” line?</p>
<h2 id="overfitting-a-curve">(Over)Fitting a curve</h2>
<p>Lacking justification to choose between any of these three, we could reduce the case to one that we have solved already, which is usually a good strategy in mathematics. In particular, we could try to reduce the <em>overdetermined</em> to the <em>determined</em> case. Above, we noticed that we can exactly solve for the two parameters $(b_0, b_1)$ using two data points ${P_1, P_2}$. This generalizes such that we can exactly solve for $p$ parameters using $p$ data points. In the problem above, we have three data points, but only two parameters. Let’s bend the notion of a <em>line</em> a bit — call it <em>curve</em> — and introduce a third parameter $b_2$. But what multiplies this parameter $b_2$ in our equations? It seems we are missing a dimension. To amend this, let’s add a dimension by simply squaring the $x$ coordinate such that a new point becomes $P_1’ = (y_1, x_1, x_1^2)$. The resulting system of equations is</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
2 &= b_0 + 1b_1 + 1b_2\\
1 &= b_0 + 3b_1 + 9b_2\\
2 &= b_0 + 2b_1 + 4b_2 \enspace .
\end{aligned} %]]></script>
<p>To simplify notation, we can write these equations in matrix algebra. Specifically, we write</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{y} &= \mathbf{X}\mathbf{b} \\[1em]
\begin{pmatrix}
2 \\ 1 \\ 2
\end{pmatrix} &=
\begin{pmatrix}
1 & 1 & 1\\
1 & 3 & 9\\
1 & 2 & 4
\end{pmatrix} \cdot
\begin{pmatrix}
b_0 \\ b_1 \\ b_2
\end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>where we are again interested in solving for the unknown $\mathbf{b}$. Because this system is <em>determined</em>, we can arrive at the solution by inverting the matrix $\mathbf{X}$, such that $\mathbf{b} = \mathbf{X}^{-1}\mathbf{y}$, where $\mathbf{X}^{-1}$ is the inverse of $\mathbf{X}$. The resulting “line” is shown in Figure 4 on the left.</p>
<div style="float: left;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-4-1.png" title="Fitting a curve with three points" alt="Fitting a curve with three points" style="display: block; margin: auto 0 auto auto;" />
</div>
<p>There are two issues with this approach. First, it leads to overfitting, that is, while we explain the data at hand well (in fact, we do so perfectly), it might poorly generalize to new data. For example, this curve is so peculiar (and it would get much more peculiar if we had fitted it to more data in the same way) that it is likely that new points lie far away from it. Second, we haven’t really explained anything. In the words of the great R.A. Fisher:</p>
<blockquote>
<p>“[T]he objective of statistical methods is the reduction of data. A quantity of data, which usually by its mere bulk is incapable of entering the mind, is to be replaced by relatively few quantities which shall adequately represent […] the relevant information contained in the original data.” (Fisher, 1922, p. 311)</p>
</blockquote>
<p>By introducing as many parameters as we have data points, no such reduction has taken place. So let’s go back to our original problem: which line should we draw through three, or more generally, any number of points $n$?</p>
<h2 id="legendre-and-the-best-fit">Legendre and the “best fit”</h2>
<p>Reducing the overdetermined to the determined case did not really work. But there is still one option: reducing it to the underdetermined case. To achieve that, we make the reasonable assumption that each observation is corrupted by noise, such that</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
1 &= b_0 + 4b_1 + \epsilon_1 \\
4 &= b_0 + 2b_1 + \epsilon_2 \\
3 &= b_0 + 1b_1 + \epsilon_3
\end{aligned} %]]></script>
<p>where ($\epsilon_1$, $\epsilon_2$, $\epsilon_3$) are unobserved quantities<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. This introduces another $n$ unknowns! Therefore, we have too few equations for too many parameters — our previously overdetermined system becomes underdetermined.</p>
<p>However, as we saw above, we cannot uniquely solve an underdetermined system of equations. We have to add more constraints. Adrien-Marie Legendre, of whom <a href="https://en.wikipedia.org/wiki/Adrien-Marie_Legendre#/media/File:Legendre.jpg">favourable pictures are difficult to find</a>, proposed what has become known as the <em>methods of least squares</em> to solve this problem:</p>
<blockquote>
<p>“Of all the principles which can be proposed for that purpose, I think there is none more general, more exact, and more easy of application, that of which we made use in the preceding researches, and which consists of rendering the sum of squares of the errors a <em>minimum</em>. By this means, there is established among the errors a sort of equilibrium which, preventing the extremes from exerting an undue influence, is very well fitted to reveal that state of the system which most nearly approaches the truth. (Legendre, 1805, p. 72-73)</p>
</blockquote>
<div style="float: right; padding: 10px 10px 10px 0px;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-5-1.png" title="Fitting the line that minimizes the sum of squared errors." alt="Fitting the line that minimizes the sum of squared errors." style="display: block; margin: auto 0 auto auto;" />
</div>
<p>There is only one line that minimizes the sum of squared errors. Thus, by adding this constraint we can uniquely solve the underdetermined system; see the Figure on the right. The development of least squares was a watershed moment in mathematical statistics — Stephen Stigler likens its importance to the development of calculus in mathematics (Stigler, 1986, p. 11).</p>
<p>We have now seen, conceptually, how to find the “best” fitting line. But how do we do it mathematically? How do we arrive at the Figure on the right? I will illustrate two approaches: the one proposed by Legendre using optimization, and another one using a geometrical insight.</p>
<h2 id="least-squares-i-optimization">Least squares I: Optimization</h2>
<p>Our goal is to find the line that minimizes the <em>sum of squared errors</em>. To simplify, we center the data by subtracting the mean from $y$ and $x$, respectively; i.e., $y’ = y - \frac{1}{n} \sum_{i=1}^n y_i$ and $x’ = x - \frac{1}{n} \sum_{i=1}^n x_i$. This makes it such that the intercept is zero, $b_0 = 0$, and we avoid the need to estimate it. In the following, to avoid cluttering notation, I will omit the apostrophe and assume both $y$ and $x$ are mean-centered.</p>
<p>For a particular observation $y_i$, our line predicts it to be $x_i b_1$. This implies that the error is $\epsilon_i = y_i - x_i b_1$, and the sum of all squared errors is</p>
<script type="math/tex; mode=display">\sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (y_i - x_i b_1)^2 \enspace .</script>
<p>We want to find the value for $b_1$, call it $\hat{b}_1$, that minimizes this quantity; that is, we must solve</p>
<script type="math/tex; mode=display">\hat{b}_1 = \underbrace{\text{argmin}}_{b_1} \left (\sum_{i=1}^n (y_i - x_i b_1)^2 \right) \enspace</script>
<p>We could use fancy algorithms like <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a>, but we can also engage in some good old high school mathematics and minimize the expression analytically. We note that the expression is quadratic, and thus has a single minimum, and this happens when the derivative is zero. Alas, to work!</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= \frac{\partial}{\partial b_1} \left (\sum_{i=1}^n (y_i - x_i b_1)^2 \right) \\[0.5em]
0 &= \frac{\partial}{\partial b_1} \left(\sum_{i=1}^n y_i^2 - 2\sum_{i=1}^n y_i x_i b_1 + \sum_{i=1}^n x_i^2 b_1^2 \right) \\[0.5em]
0 &= 0 - 2 \sum_{i=1}^n y_i x_i + 2 \sum_{i=1}^n x_i^2 b_1 \\[0.5em]
2 \sum_{i=1}^n x_i^2 b_1 &= 2 \sum_{i=1}^n y_i x_i \\[0.5em]
\hat{b}_1 &= \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i^2} \enspace ,
\end{aligned} %]]></script>
<p>where $\sum_{i=1}^n y_i x_i$ is the (scaled by $n$) covariance between x and y, and $\sum_{i=1}^n x_i^2$ is the (scaled by $n$) variance of x.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<h2 id="least-squares-ii-projection">Least squares II: Projection</h2>
<p>Another way to think about this problem is <em>geometrically</em>. This requires some linear algebra, and so we better write the system of equations in matrix form. For ease of exposure, we again mean-center the data. First, note that the errors in matrix form yield</p>
<div style="float: left; padding: 10px 10px 10px 0px;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-6-1.png" title="Figure illustrating the geometric insight." alt="Figure illustrating the geometric insight." style="display: block; margin: auto 0 auto auto;" />
</div>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix}
\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n
\end{pmatrix} &=
\begin{pmatrix}
y_1 \\ y_2 \\ \vdots \\ y_n
\end{pmatrix} -
\begin{pmatrix}
x_1 \\ x_2 \\ \vdots \\ z_n
\end{pmatrix} b_1 \\[1em]
\mathbf{\epsilon} &= \mathbf{y} - \mathbf{x}b_1 \enspace .
\end{aligned} %]]></script>
<p>and that the errors are <em>perpendicular</em> to the x-axis, that is, they are at a 90 degree angle of each other; see the Figure on the left. This means that the <em>dot product</em> of the vector of errors and the x-axis points is zero, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
x_1 & x_2 & \ldots & x_n
\end{pmatrix}
\begin{pmatrix}
\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n
\end{pmatrix} = 0 \enspace , %]]></script>
<p>or $\mathbf{x}^T \mathbf{\epsilon} = 0$, in short. Using this geometric insight, we can derive the least squares solution as follows</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{x}^T \mathbf{\epsilon} &= 0 \\[.5em]
\mathbf{x}^T \left(\mathbf{y} - \mathbf{x} b_1 \right) &= 0 \\[.5em]
\mathbf{x}^T \mathbf{x} b_1 &= \mathbf{x}^T \mathbf{y} \\[.5em]
b_1 &= \frac{\mathbf{x}^T \mathbf{y}}{\mathbf{x}^T \mathbf{x}} \\[.5em]
b_1 &= \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n x_i^2} \enspace ,
\end{aligned} %]]></script>
<p>which yields the same result as above.<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup> As an important special case, note the least square solution to a system of equations with only the intercept $b_0$ as unknown, i.e., $y_i = b_0$, yields the mean of $y$. It is this fact that Gauss used to justify the Gaussian distribution as an error distribution, see below.</p>
<h2 id="gauss-laplace-and-how-good-is-best">Gauss, Laplace, and “how good is best?”</h2>
<!-- The methods of least squares, first published by Legendre in 1805, is an intuitive method to fit curves. *Of course you would minimize the sum of squared errors*, you might say. We could either predict too high or too low a value, leading to positive or negative errors, respectively. So as to not cancel each other out when summing them, we square the errors them. The square also penalizes large errors more than smaller errors ... *now that I think of it*, you might continue, *why not just take the absolute value? Why square?* -->
<p>The method of least squares yields the “best” fitting line in the sense that it minimizes the sum of squared errors. But without any statements about the stochastic nature of the errors $\mathbf{\epsilon}$, the question of “how good is best?” remains unanswered.</p>
<p>It was Carl Friedrich Gauss who in 1809 couched the least squares problem in <em>probabilistic terms</em>. Specifically, he assumed that each error term $\epsilon_i$ comes from some distribution $\phi$. Using this distribution, the probability (density) of a particular $\epsilon_i$ is large when $\epsilon_i$ is small, that is, when observed and predicted value are close together. Further assuming that the errors are <em>independent and identically</em> distributed, he wanted to find the parameter values which <em>maximize</em></p>
<script type="math/tex; mode=display">\Omega = \phi(\epsilon_1) \cdot \phi(\epsilon_2) \cdot \ldots \cdot \phi(\epsilon_n) = \prod_{i=1}^n \phi(\epsilon_i) \enspace ,</script>
<p>that is, maximize the probability of the errors being small (see also Stigler, 1986, p. 141).<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup></p>
<p><img src="../assets/img/Gauss.jpg" align="left" style="padding: 10px 10px 10px 0px;" /></p>
<p>All that is left now is to find the distribution $\phi$. Gauss noticed that he could make some general statements about $\phi$, namely that it should be symmetric and have its maximum at 0. He then <em>assumed</em> that the mean should be the best value for summarizing $n$ measurements $(y_1, \ldots, y_n)$; that is, he assumed that maximizing $\Omega$ should lead to the same solution as minimizing the sum of squared errors when we have one unknown.<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup></p>
<p>With this circularity — to justify least squares, I assume least squares — he proved that the distribution must be of the form</p>
<script type="math/tex; mode=display">\phi(\epsilon_i) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left (-\frac{1}{2\sigma^2} \epsilon_i^2 \right) \enspace ,</script>
<p>where $\sigma^2$ is the variance<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup>; see the Figure below for three examples. The distribution has become known as the Gaussian distribution, although — in the spirit of Stigler’s law of eponomy<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup> — de Moivre and Laplace have discovered it before Gauss (see also Stahl, 2006). Karl Pearson popularized the term <em>normal distribution</em>, an act for which he seems to have shown some regret:</p>
<blockquote>
<p>“Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another ‘abnormal’.” (Pearson, 1920, p. 25)</p>
</blockquote>
<!-- <div style= "float: left; padding: 10px 10px 10px 0px;"> -->
<p><img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-7-1.png" title="Shows three Normal distributions with different variance." alt="Shows three Normal distributions with different variance." style="display: block; margin: auto;" />
<!-- </div> --></p>
<p>Using the Gaussian distribution, the maximization problem becomes</p>
<script type="math/tex; mode=display">\begin{aligned}
\Omega = \prod_{i=1} \phi(\epsilon_i) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}} \exp \left (-\frac{1}{2\sigma^2} \epsilon_i^2 \right) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \exp \left (-\frac{1}{2\sigma^2} \sum_{i=1}^n \epsilon_i^2 \right) \enspace .
\end{aligned}</script>
<p>Note that the value at which $\Omega$ takes its maximum does not change when we drop the constants and take logarithms. This results in $-\sum_{i=1}^n \epsilon_i^2$ as being the expression to be maximized, which is the same as minimizing its negation, that is, minimizing the sum of squared errors.</p>
<p><img src="../assets/img/Laplace.jpg" align="right" style="padding: 10px 0px 10px 10px" /></p>
<p>The “Newton of France”, Pierre Simone de Laplace, took notice of Gauss’ argument in 1810 and rushed to give the Gaussian error curve a much more beautiful justification. If we take the errors to be themselves aggregates of many (tiny) perturbing influences, then they will be normally distributed by the <em>central limit theorem</em>. So what is this central limit theorem, anyway?</p>
<h2 id="the-central-limit-theorem">The central limit theorem</h2>
<p>The central limit theorem is one of the most stunning theorems of statistics. In the poetic words of Francis Galton</p>
<blockquote>
<p>“I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66)</p>
</blockquote>
<p>The theorem basically says that if you have a sequence of independent and identically distributed random variables — what Galton calls “the mob” — and if that sequence has finite variance, than the mean of this sequence, as $n$ grows larger and larger — “the greater the apparent anarchy” — will get closer and closer to a normal distribution. As $n \rightarrow \infty$, the mean in fact converges in distribution to the normal distribution.<sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup></p>
<!-- Formally, let $(X_1, X_2, \ldots, X_n)$ be a sequence of $n$ independent and identically distributed random variables with mean $\mu$ and variance $\sigma^2$. Then we have that, by the law of large numbers, that the sample mean $\bar{X}_n$ converges to $\mu$ as $n \rightarrow \infty$. The central limit theorem states that, as $n \leftarrow n$ -->
<!-- $$ -->
<!-- \sqrt{n} \left(\frac{\bar{X}_n - \mu}{\sigma}\right) \rightarrow \mathcal{N}(0, 1) -->
<!-- $$ -->
<p>Laplace realized that, if one takes the errors in the least squares problem to be themselves aggregates (i.e., means) of small influences, then they will be normally distributed. This provides an elegant justification for the least squares solution.</p>
<p>To illustrate, assume that a particular error $\epsilon_i$ is in fact the average of $m = 500$ small irregularities that are independent and identically distributed; for instance, assume these influences follow a uniform distributions. Let’s say we have $n = 200$ observations, thus 200 individual errors. The R code and Figure below illustrate that the error distribution will tend to be Gaussian.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1776</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">200</span><span class="w"> </span><span class="c1"># number of errors</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">500</span><span class="w"> </span><span class="c1"># number of influences one error is made of</span><span class="w">
</span><span class="c1"># compute errors which are themselves aggregates of smaller influences</span><span class="w">
</span><span class="n">errors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">replicate</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">runif</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="m">-10</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">)))</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">errors</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Central Limit Theorem'</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">expression</span><span class="p">(</span><span class="n">epsilon</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'grey76'</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Frequency'</span><span class="p">,</span><span class="w">
</span><span class="n">yaxt</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># plot approximate Gaussian density line</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.001</span><span class="p">)</span><span class="w">
</span><span class="n">se</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">20</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">12</span><span class="w"> </span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="n">m</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">mean</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">se</span><span class="p">))</span></code></pre></figure>
<p><img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-8-1.png" title="Illustrates the Central Limit Theorem." alt="Illustrates the Central Limit Theorem." style="display: block; margin: auto;" /></p>
<p>I don’t know about you, but I think this is really cool. We started out with small irregularities that are uniformly distributed. Then we took an average of a bulk ($m = 500$) of those which constitute an error $\epsilon_i$; thus, the error itself is an aggregate. Now, by some fundamental fact about how our world works, the distribution of these errors (here, $n = 200$) can be well approximated by a Gaussian distribution. I can see why, as Galton conjectures, the Greeks would have deified such a law, if only they had known of it.</p>
<h2 id="linear-regression">Linear regression</h2>
<p>One neat feature of the Gaussian distribution is that any <em>linear combination</em> of normally distributed random variables is itself normally distributed. We may write the linear regression problem in matrix form, which makes apparent that $\mathbf{y}$ is a weighted linear combination of $\mathbf{x}$. Specifically, if we have $n$ data points, we have a system of $n$ equations which we can write in matrix notation more concisely</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix}
y_1 \\
y_2 \\
\vdots \\
y_n
\end{pmatrix} &=
\begin{pmatrix}
1 & x_1 \\
1 & x_2 \\
\vdots & \vdots \\
1 & x_n
\end{pmatrix} \cdot
\begin{pmatrix}
b_0 \\
b_1
\end{pmatrix} +
\begin{pmatrix}
\epsilon_1 \\
\epsilon_2 \\
\vdots \\
\epsilon_n \\
\end{pmatrix} \\[1em]
\mathbf{y} &= \mathbf{X}\mathbf{b} + \mathbf{\epsilon}
\end{aligned} %]]></script>
<div style="float: left; padding: 10px 10px 10px 0px;">
<img src="/assets/img/2019-01-11-Curve-Fitting-Gaussian.Rmd/unnamed-chunk-9-1.png" title="Illustrates that linear regression assumes that the conditional distribution of the response y given the features x is a Gaussian distribution." alt="Illustrates that linear regression assumes that the conditional distribution of the response y given the features x is a Gaussian distribution." style="display: block; margin: auto;" />
</div>
<p>Due to this linearity, the assumption of normally distributed errors propagates and results in a <em>conditional normal distribution</em> of $\mathbf{y}$, that is,</p>
<script type="math/tex; mode=display">y_i \mid \mathbf{x}_i \sim \mathcal{N}(\mathbf{x}_i^T \mathbf{b}, \sigma^2) \enspace .</script>
<p>In other words, the probability density of a particular point $y_i$ is given by</p>
<script type="math/tex; mode=display">\frac{1}{\sqrt{2\pi\sigma^2}} \exp \left( -\frac{1}{2\sigma^2} (y_i - \mathbf{x}_i^T \mathbb{b})^2\right) \enspace ,</script>
<p>which is visualized in the Figure on the left. Intuitively, the smaller the error variance, the better the fit.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post, we have discussed the mother of all curve fitting problems — fitting a straight line to data points — in great detail. On this journey, we have met the method of least squares, a pillar of statistical thinking. We have seen how Gauss arrived at “his” distribution, and how Laplace gave it a beautiful justification in terms of the central limit theorem. With this, it was only a small step towards linear regression, one of the most important tools in statistics and machine learning.</p>
<hr />
<p><em>I would like to thank Don van den Bergh and Jonas Haslbeck for helpful comments on this blogpost.</em></p>
<h2 id="post-scriptum-i-linking-correlation-to-regression">Post Scriptum I: Linking correlation to regression</h2>
<p>It is a <em>trivium</em> that correlation does not imply causation. <a href="https://stats.stackexchange.com/questions/376920/the-book-of-why-by-judea-pearl-why-is-he-bashing-statistics">Some</a> believe that linear regression is a causal model. This is not true. To see this, we can relate the regression coefficient in simple linear regression to correlation — they differ only in standardization.</p>
<p>Assuming mean-centered data, the sample Pearson correlation is defined as</p>
<script type="math/tex; mode=display">r_{xy} = \frac{\sum_{i=1}^n x_i y_i}{\sqrt{\sum_{i=1}^n x_i^2 \sum_{i=1}^n y_i^2}} \enspace .</script>
<p>Note that correlation is symmetric — it does not matter whether we correlate $x$ with $y$, or $y$ with $x$. In contrast, regession is not symmetric. In the main text, we have used $x$ to predict $y$ which yielded</p>
<script type="math/tex; mode=display">b_{xy} = \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i^2} \enspace .</script>
<p>If we were to use $y$ to predict $x$, the coefficient would be</p>
<script type="math/tex; mode=display">b_{yx} = \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n y_i^2} \neq b_{xy} \neq r_{xy} \enspace .</script>
<p>However, by <em>standardizing</em> the data, that is, by dividing the variables by there respective standard deviations, the regression coefficient becomes the sample correlation, i.e.,</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\partial L}{\partial b_{xy}} &= \frac{\partial}{\partial b_{xy}} \sum_{i=1}^n\left(\frac{y_i}{\sqrt{\sum_{i=1}^n y_i^2}} - b_{xy} \frac{x_i}{\sqrt{\sum_{i=1}^n x_i^2}} \right)^2 \\[0.5em]
&= \frac{\partial}{\partial b_{xy}} \left( \frac{\sum_{i=1}^n y_i^2}{\sqrt{\sum_{i=1}^n y_i^2}} - 2 b_{xy} \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} + b_{xy}^2 \frac{\sum_{i=1}^n x_i^2}{\sum_{i=1}^n x_i^2}\right)\\[0.5em]
&= 0 - 2 \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} + 2 b_{xy} \\[0.5em]
2 b_{xy} &= 2 \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} \\[0.5em]
b_{xy} &= \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} \enspace,
\end{aligned} %]]></script>
<p>which is equal to $r_{xy}$. This <em>standardized</em> regression coefficient can also be achieved by multiplying the <em>raw</em> regression coefficient, i.e.,</p>
<script type="math/tex; mode=display">b_s = b_{xy} \times \frac{\sqrt{\sum_{i=1}^n x_i^2}}{\sqrt{\sum_{i=1}^n y_i^2}} = \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i^2} \times \frac{\sqrt{\sum_{i=1}^n x_i^2}}{\sqrt{\sum_{i=1}^n y_i^2}} = \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}}.</script>
<p>$b_{yx}$ can be standardized in a similar way, such that</p>
<script type="math/tex; mode=display">b_s = b_{yx} \times \frac{\sqrt{\sum_{i=1}^n y_i^2}}{\sqrt{\sum_{i=1}^n x_i^2}} = \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n y_i^2} \times \frac{\sqrt{\sum_{i=1}^n y_i^2}}{\sqrt{\sum_{i=1}^n x_i^2}} = \frac{\sum_{i=1}^n y_i x_i}{\sqrt{\sum_{i=1}^n y_i^2 \sum_{i=1}^n x_i^2}} \enspace .</script>
<!-- ## Post Scriptum II: Why $\pi$? -->
<!-- You might wonder why there is a $\pi$ in the expression of the normal distribution. The mathematical reason for this is that -->
<!-- $$ -->
<!-- \int_{-\infty}^{\infty} e^{-x^2 / 2} \, \mathrm{d}x = \sqrt{2\pi} \enspace , -->
<!-- $$ -->
<!-- shown by Laplace. -->
<!-- Because the proof is so cool, I reproduce it here. As is so often the case in mathematics, writing things more complicated can help[^9], i.e., -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \left(\int_{-\infty}^{\infty} e^{-x^2 / 2} \, \mathrm{d}x \right) \left(\int_{-\infty}^{\infty} e^{-x^2 / 2} \, \mathrm{d}x \right) &= \left(\int_{-\infty}^{\infty} e^{-x^2 / 2} \, \mathrm{d}x \right) \left(\int_{-\infty}^{\infty} e^{-y^2 / 2} \, \mathrm{d}y \right) \\[1em] -->
<!-- &= \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{-\frac{x^2 + y^2}{2}} \, \mathrm{d}x \mathrm{d}y \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- Note that we can describe a point in the plane using polar coordinates, i.e., $x = r \, \text{cos}\,\theta$ and $y = r \, \text{sin}\,\theta$, where $r$ is the distance of $(x, y)$ to the origin and $\theta \in [0, 2\pi)$ is the angle. The Jacobian matrix of this transformation is -->
<!-- $$ -->
<!-- \frac{d(x, y)}{d(r, \theta)} = \begin{pmatrix} \text{cos}\,\theta & -r \, \text{sin}\, \theta\\ \text{sin}\,\theta & r \, \text{cos}\,\theta \end{pmatrix} \enspace , -->
<!-- $$ -->
<!-- which has determinant $r$. Noting that $x^2 + y^2 = r^2$, we continue with -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \int_{0}^{2\pi} \int_{0}^{\infty} r \cdot e^{-r^2 / 2} \, \mathrm{d}r \mathrm{d}\theta = \int_{0}^{\infty} -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- ## Post Scriptum II: Curves and regularization -->
<!-- In the blog post, I have hinted at the fact that adding polynomials in order to fit the data perfectly generalizes poorly to unseen data. We can show this with a simple example. Assume we observe $n = 10$ data points from the following polynomial function -->
<!-- $$ -->
<!-- y = 2 + 0.25 x -0.75 x^2 \enspace . -->
<!-- $$ -->
<!-- ```{r, echo = FALSE} -->
<!-- gen_y <- function(x, sd_err = NULL) { -->
<!-- y <- 2 + 3 * x - 0.2 * x^2 -->
<!-- if (!is.null(sd_err)) { y <- y + rnorm(length(x), 0, sd_err) } -->
<!-- y -->
<!-- } -->
<!-- get_coefs <- function(y, X) solve(t(X) %*% X) %*% t(X) %*% y -->
<!-- get_prederr <- function(test_set, train_set) { -->
<!-- } -->
<!-- n_train <- 50 -->
<!-- n_test <- 150 -->
<!-- x <- seq(0, 10, length.out = n_train + n_test) -->
<!-- data_sets <- t(replicate(n = 1000, gen_y(x, sd_err = 1))) -->
<!-- train_sets <- data_sets[, seq(n_train)] -->
<!-- test_sets <- data_sets[, -seq(n_train)] -->
<!-- plot(1, type = "n", xlim = c(0, 10), ylim = c(0, 15), -->
<!-- bty = "n", xlab = "x", ylab = "y", main = 'True function' -->
<!-- ) -->
<!-- lines(x, gen_y(x), pch = 20, col = 'skyblue') -->
<!-- x <- seq(0, 10) -->
<!-- ``` -->
<hr />
<h2 id="references">References</h2>
<ul>
<li>
<p>Blitzstein, J. K., & Hwang, J. (2014). <em>Introduction to Probability</em>. Chapman and Hall/CRC. [<a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1466575573">Link</a>]</p>
</li>
<li>
<p>Ford, M. (2018). <em>Architects of Intelligence: The truth about AI from the people building it</em>. Packt Publishing. [<a href="https://www.amazon.com/Architects-Intelligence-truth-people-building/dp/1789131510/ref=sr_1_1?s=books&ie=UTF8&qid=1546765292&sr=1-1&keywords=martin+ford">Link</a>]</p>
</li>
<li>
<p>Fisher, R. A. (1922). On the mathematical foundations of theoretical statistics. <em>Phil. Trans. R. Soc. Lond. A</em>, <em>222(594-604)</em>, 309-368. [<a href="https://royalsocietypublishing.org/doi/abs/10.1098/rsta.1922.0009">Link</a>]</p>
</li>
<li>
<p>Galton, F. (1889). <em>Natural Inheritance</em>. London, UK: Richard Clay and Sons. [<a href="http://galton.org/books/natural-inheritance/pdf/galton-nat-inh-1up-clean.pdf">Link</a>]</p>
</li>
<li>
<p>Pearson, K. (1920). Notes on the history of correlation. <em>Biometrika</em>, <em>13(1)</em>, 25-45. [<a href="https://www.jstor.org/stable/2331722?seq=1#metadata_info_tab_contents">Link</a>]</p>
</li>
<li>
<p>Stahl, S. (2006). The evolution of the normal distribution. <em>Mathematics magazine</em>, <em>79(2)</em>, 96-113. [<a href="https://www.tandfonline.com/doi/abs/10.1080/0025570X.2006.11953386?journalCode=umma20">Link</a>]</p>
</li>
<li>
<p>Stigler, S. M. (1980). Stigler’s Law of Eponymy. <em>Transactions of the New York Academy of Sciences</em>, <em>39(1 Series II)</em>, 147-157. [<a href="https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/j.2164-0947.1980.tb02775.x">Link</a>]</p>
</li>
<li>
<p>Stigler, S. M. (1981). Gauss and the invention of least squares. <em>The Annals of Statistics</em>, <em>9(3)</em>, 465-474. [<a href="https://projecteuclid.org/download/pdf_1/euclid.aos/1176345451">Link</a>]</p>
</li>
<li>
<p>Stigler, S. M. (1986). <em>The history of statistics: The measurement of uncertainty before 1900</em>. Harvard University Press. [<a href="http://www.hup.harvard.edu/catalog.php?isbn=9780674403413">Link</a>]</p>
</li>
<li>
<p>Stigler, S. M. (2007). The epic story of maximum likelihood. <em>Statistical Science</em>, <em>22(4)</em>, 598-620. [<a href="https://www.jstor.org/stable/27645865?casa_token=QqFTvsgYX0MAAAAA:VfdvDgUOdMH95y5V-d9YQ4P1SemlxCU7Xrx-9OIEG4EN69iIU3L7yU5q4XIewzYjPhpDzKFh-LbJk6X6RiogDo_2fw4kI0Q_Tl5GSgBvaTdzwGGHTj_xQQ&seq=1#metadata_info_tab_contents">Link</a>]</p>
</li>
<li>
<p>Yarkoni, T., & Westfall, J. (2017). Choosing prediction over explanation in psychology: Lessons from machine learning. <em>Perspectives on Psychological Science</em>, <em>12(6)</em>, 1100-1122. [<a href="https://journals.sagepub.com/doi/abs/10.1177/1745691617693393?casa_token=FaEkfz8xxLMAAAAA%3AxO7ygcT8h8GVYPqizcJ8Mt3spZ8vinhA4yGQ_j1w_-HwjqZ04-yphCnCsC0j0S2xghh5DR69ppb3od4">Link</a>]</p>
</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>See <a href="https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/">this</a> interview and Martin Ford’s <a href="https://www.amazon.com/Architects-Intelligence-truth-people-building/dp/1789131510/ref=sr_1_1?s=books&ie=UTF8&qid=1546765292&sr=1-1&keywords=martin+ford">new book</a> in which he interviews 23 leading voices in AI, one of them being Pearl. (The comment about curve fitting is on p. 366). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This blog post was in part inspired by Neil Lawrence’s talk at the Gaussian Process summer school last year. You can view it <a href="http://gpss.cc/gpss18/program">here</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I think it is fair to say that this ‘error’ is a catch-all term, expressing our ignorance or lack of knowledge. <a href="https://www.bayesianspectacles.org/a-galton-board-demonstration-of-why-all-statistical-models-are-misspecified/">This</a> blog post argues that, by virtue of introducing error, all statistical models are misspecified. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>The generalization to higher dimensions is straightforward. Instead of simple derivatives, we have to take partial derivatives, one for each parameter $b$. When expressed in matrix form, the solution yields $\mathbf{b} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T y$, where $\mathbf{x}$ is a $n \times p$ matrix and $\mathbf{b}$ is a $p \times 1$ vector. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The generalization to higher dimensions is straightforward. Simply let $\mathbf{x}$ to be a $n \times p$ matrix and $\mathbf{b}$ be a $p \times 1$ vector, where $p$ is the number of dimensions. The derivation is exactly the same, except that because $\mathbf{x}^T \mathbf{x}$ is not a scalar anymore but a matrix, we have to left-multiply by its inverse. This yields $\mathbf{b} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T y$. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>To arrive at this maximization criterion, Gauss used Bayes’ rule. Inspired by Laplace, he assumed a uniform prior over the parameters and chose the mode of the posterior distribution over the parameters as his estimator; this is equivalent to maximum likelihood estimation, see also Stigler (2008). <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>Note that while I have mostly talked about “fitting lines” or “predicting values”, the words used during the discovery of least squares where “summarizing” or “aggregating” observations. For instance, they would say that, absent other information, the mean summarizes the data best while I would, using the language of this blog post, be forced to say the mean predicts the data best. I think that “summarizing” is more adequate than “predict”, especially since we are not predicting out of sample (see also Yarkoni & Westfall, <a href="https://journals.sagepub.com/doi/abs/10.1177/1745691617693393?casa_token=HBqivCFyDcUAAAAA%3ABMKLq2EDzASwBuP5yNRBXk45iblKe1RJ9-lBSI3sR70ATw28R7gilW1s30iDIgW8QYonpDqxs14J9w">2017</a>). <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>Gauss introduced a <a href="https://en.wikipedia.org/wiki/Normal_distribution#History">different parameterization</a> using <em>precision</em> instead of variance. The parameterization using variance was introduced by Fisher. (Karl Pearson used the standard deviation before.) <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>The law states that “no scientific discovery is named after its original discoverer” (Stigler, <a href="https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/j.2164-0947.1980.tb02775.x">1980</a>, p. 147). <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
<li id="fn:10">
<p>The proof is not difficult, and can be found in Blitzstein & Hwang (<a href="https://www.amazon.com/Introduction-Probability-Chapman-Statistical-Science/dp/1466575573">2014</a>, p. 436), which is an amazing book on probability theory full of <a href="https://twitter.com/fdabl/status/981106227143954432">gems</a>. I wholeheartedly recommend working through Blitzstein’s <a href="https://projects.iq.harvard.edu/stat110/youtube">Stat 110</a> class — it’s one of the best classes I ever took (online, and in general). <a href="#fnref:10" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderJudea Pearl said that much of machine learning is just curve fitting1 — but it is quite impressive how far you can get with that, isn’t it? In this blog post, we will look at the mother of all curve fitting problems: fitting a straight line to a number of points. In doing so, we will engage in some statistical detective work and discover the methods of least squares as well as the Gaussian distribution.2 Fitting a line A straight line in the Euclidean plane is described by an intercept () and a slope ($b_1$), i.e., We are interested in finding the values for $(b_0, b_1)$, and so we must collect data points $d = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}$. Data collection is often tedious, so let’s do it one point at a time. The first point is $P_1 = (x_1, y_1) = (1, 2)$, and if we plug the values into the equation for the line (i.e., set $x_1 = 1$ and $y_1 = 2$), we get an equation with two unknowns. We call this system of equations underdetermined because we cannot uniquely solve for $b_0$ and $b_1$, but we will have a number of solutions all for which $b_1 = 2 - b_0$; see Figure 1 on the left. However, if we add another point $P_2 = (3, 1)$, the resulting system of equations becomes We have two equations in two unknowns, and this is determined or identified: there is a unique solution for $b_0$ and $b_1$. After some rearranging, we find $b_1 = -0.5$ and $b_0 = 2.5$. This specifies exactly one line, as you can see in Figure 2 on the right. We could end the blog post here, but that would not be particularly insightful for data analysis problems in which we have more data points. Thus, let’s see where it takes us when we add another point, $P_3 = (2, 2)$. The resulting system of equations becomes which is overdetermined — we cannot fit a line that passes through all three of these points. We can, for example, fit three separate lines, given by two out of three of the equations; see Figure 3. But which of these lines, if any, is the “best” line? (Over)Fitting a curve Lacking justification to choose between any of these three, we could reduce the case to one that we have solved already, which is usually a good strategy in mathematics. In particular, we could try to reduce the overdetermined to the determined case. Above, we noticed that we can exactly solve for the two parameters $(b_0, b_1)$ using two data points ${P_1, P_2}$. This generalizes such that we can exactly solve for $p$ parameters using $p$ data points. In the problem above, we have three data points, but only two parameters. Let’s bend the notion of a line a bit — call it curve — and introduce a third parameter $b_2$. But what multiplies this parameter $b_2$ in our equations? It seems we are missing a dimension. To amend this, let’s add a dimension by simply squaring the $x$ coordinate such that a new point becomes $P_1’ = (y_1, x_1, x_1^2)$. The resulting system of equations is To simplify notation, we can write these equations in matrix algebra. Specifically, we write where we are again interested in solving for the unknown $\mathbf{b}$. Because this system is determined, we can arrive at the solution by inverting the matrix $\mathbf{X}$, such that $\mathbf{b} = \mathbf{X}^{-1}\mathbf{y}$, where $\mathbf{X}^{-1}$ is the inverse of $\mathbf{X}$. The resulting “line” is shown in Figure 4 on the left. There are two issues with this approach. First, it leads to overfitting, that is, while we explain the data at hand well (in fact, we do so perfectly), it might poorly generalize to new data. For example, this curve is so peculiar (and it would get much more peculiar if we had fitted it to more data in the same way) that it is likely that new points lie far away from it. Second, we haven’t really explained anything. In the words of the great R.A. Fisher: “[T]he objective of statistical methods is the reduction of data. A quantity of data, which usually by its mere bulk is incapable of entering the mind, is to be replaced by relatively few quantities which shall adequately represent […] the relevant information contained in the original data.” (Fisher, 1922, p. 311) By introducing as many parameters as we have data points, no such reduction has taken place. So let’s go back to our original problem: which line should we draw through three, or more generally, any number of points $n$? Legendre and the “best fit” Reducing the overdetermined to the determined case did not really work. But there is still one option: reducing it to the underdetermined case. To achieve that, we make the reasonable assumption that each observation is corrupted by noise, such that where ($\epsilon_1$, $\epsilon_2$, $\epsilon_3$) are unobserved quantities3. This introduces another $n$ unknowns! Therefore, we have too few equations for too many parameters — our previously overdetermined system becomes underdetermined. However, as we saw above, we cannot uniquely solve an underdetermined system of equations. We have to add more constraints. Adrien-Marie Legendre, of whom favourable pictures are difficult to find, proposed what has become known as the methods of least squares to solve this problem: “Of all the principles which can be proposed for that purpose, I think there is none more general, more exact, and more easy of application, that of which we made use in the preceding researches, and which consists of rendering the sum of squares of the errors a minimum. By this means, there is established among the errors a sort of equilibrium which, preventing the extremes from exerting an undue influence, is very well fitted to reveal that state of the system which most nearly approaches the truth. (Legendre, 1805, p. 72-73) There is only one line that minimizes the sum of squared errors. Thus, by adding this constraint we can uniquely solve the underdetermined system; see the Figure on the right. The development of least squares was a watershed moment in mathematical statistics — Stephen Stigler likens its importance to the development of calculus in mathematics (Stigler, 1986, p. 11). We have now seen, conceptually, how to find the “best” fitting line. But how do we do it mathematically? How do we arrive at the Figure on the right? I will illustrate two approaches: the one proposed by Legendre using optimization, and another one using a geometrical insight. Least squares I: Optimization Our goal is to find the line that minimizes the sum of squared errors. To simplify, we center the data by subtracting the mean from $y$ and $x$, respectively; i.e., $y’ = y - \frac{1}{n} \sum_{i=1}^n y_i$ and $x’ = x - \frac{1}{n} \sum_{i=1}^n x_i$. This makes it such that the intercept is zero, $b_0 = 0$, and we avoid the need to estimate it. In the following, to avoid cluttering notation, I will omit the apostrophe and assume both $y$ and $x$ are mean-centered. For a particular observation $y_i$, our line predicts it to be $x_i b_1$. This implies that the error is $\epsilon_i = y_i - x_i b_1$, and the sum of all squared errors is We want to find the value for $b_1$, call it $\hat{b}_1$, that minimizes this quantity; that is, we must solve We could use fancy algorithms like gradient descent, but we can also engage in some good old high school mathematics and minimize the expression analytically. We note that the expression is quadratic, and thus has a single minimum, and this happens when the derivative is zero. Alas, to work! where $\sum_{i=1}^n y_i x_i$ is the (scaled by $n$) covariance between x and y, and $\sum_{i=1}^n x_i^2$ is the (scaled by $n$) variance of x.4 Least squares II: Projection Another way to think about this problem is geometrically. This requires some linear algebra, and so we better write the system of equations in matrix form. For ease of exposure, we again mean-center the data. First, note that the errors in matrix form yield and that the errors are perpendicular to the x-axis, that is, they are at a 90 degree angle of each other; see the Figure on the left. This means that the dot product of the vector of errors and the x-axis points is zero, i.e., or $\mathbf{x}^T \mathbf{\epsilon} = 0$, in short. Using this geometric insight, we can derive the least squares solution as follows which yields the same result as above.5 As an important special case, note the least square solution to a system of equations with only the intercept $b_0$ as unknown, i.e., $y_i = b_0$, yields the mean of $y$. It is this fact that Gauss used to justify the Gaussian distribution as an error distribution, see below. Gauss, Laplace, and “how good is best?” The method of least squares yields the “best” fitting line in the sense that it minimizes the sum of squared errors. But without any statements about the stochastic nature of the errors $\mathbf{\epsilon}$, the question of “how good is best?” remains unanswered. It was Carl Friedrich Gauss who in 1809 couched the least squares problem in probabilistic terms. Specifically, he assumed that each error term $\epsilon_i$ comes from some distribution $\phi$. Using this distribution, the probability (density) of a particular $\epsilon_i$ is large when $\epsilon_i$ is small, that is, when observed and predicted value are close together. Further assuming that the errors are independent and identically distributed, he wanted to find the parameter values which maximize that is, maximize the probability of the errors being small (see also Stigler, 1986, p. 141).6 All that is left now is to find the distribution $\phi$. Gauss noticed that he could make some general statements about $\phi$, namely that it should be symmetric and have its maximum at 0. He then assumed that the mean should be the best value for summarizing $n$ measurements $(y_1, \ldots, y_n)$; that is, he assumed that maximizing $\Omega$ should lead to the same solution as minimizing the sum of squared errors when we have one unknown.7 With this circularity — to justify least squares, I assume least squares — he proved that the distribution must be of the form where $\sigma^2$ is the variance8; see the Figure below for three examples. The distribution has become known as the Gaussian distribution, although — in the spirit of Stigler’s law of eponomy9 — de Moivre and Laplace have discovered it before Gauss (see also Stahl, 2006). Karl Pearson popularized the term normal distribution, an act for which he seems to have shown some regret: “Many years ago I called the Laplace–Gaussian curve the normal curve, which name, while it avoids an international question of priority, has the disadvantage of leading people to believe that all other distributions of frequency are in one sense or another ‘abnormal’.” (Pearson, 1920, p. 25) Using the Gaussian distribution, the maximization problem becomes Note that the value at which $\Omega$ takes its maximum does not change when we drop the constants and take logarithms. This results in $-\sum_{i=1}^n \epsilon_i^2$ as being the expression to be maximized, which is the same as minimizing its negation, that is, minimizing the sum of squared errors. The “Newton of France”, Pierre Simone de Laplace, took notice of Gauss’ argument in 1810 and rushed to give the Gaussian error curve a much more beautiful justification. If we take the errors to be themselves aggregates of many (tiny) perturbing influences, then they will be normally distributed by the central limit theorem. So what is this central limit theorem, anyway? The central limit theorem The central limit theorem is one of the most stunning theorems of statistics. In the poetic words of Francis Galton “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66) The theorem basically says that if you have a sequence of independent and identically distributed random variables — what Galton calls “the mob” — and if that sequence has finite variance, than the mean of this sequence, as $n$ grows larger and larger — “the greater the apparent anarchy” — will get closer and closer to a normal distribution. As $n \rightarrow \infty$, the mean in fact converges in distribution to the normal distribution.10 Laplace realized that, if one takes the errors in the least squares problem to be themselves aggregates (i.e., means) of small influences, then they will be normally distributed. This provides an elegant justification for the least squares solution. To illustrate, assume that a particular error $\epsilon_i$ is in fact the average of $m = 500$ small irregularities that are independent and identically distributed; for instance, assume these influences follow a uniform distributions. Let’s say we have $n = 200$ observations, thus 200 individual errors. The R code and Figure below illustrate that the error distribution will tend to be Gaussian. See this interview and Martin Ford’s new book in which he interviews 23 leading voices in AI, one of them being Pearl. (The comment about curve fitting is on p. 366). ↩ This blog post was in part inspired by Neil Lawrence’s talk at the Gaussian Process summer school last year. You can view it here. ↩ I think it is fair to say that this ‘error’ is a catch-all term, expressing our ignorance or lack of knowledge. This blog post argues that, by virtue of introducing error, all statistical models are misspecified. ↩ The generalization to higher dimensions is straightforward. Instead of simple derivatives, we have to take partial derivatives, one for each parameter $b$. When expressed in matrix form, the solution yields $\mathbf{b} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T y$, where $\mathbf{x}$ is a $n \times p$ matrix and $\mathbf{b}$ is a $p \times 1$ vector. ↩ The generalization to higher dimensions is straightforward. Simply let $\mathbf{x}$ to be a $n \times p$ matrix and $\mathbf{b}$ be a $p \times 1$ vector, where $p$ is the number of dimensions. The derivation is exactly the same, except that because $\mathbf{x}^T \mathbf{x}$ is not a scalar anymore but a matrix, we have to left-multiply by its inverse. This yields $\mathbf{b} = (\mathbf{x}^T \mathbf{x})^{-1} \mathbf{x}^T y$. ↩ To arrive at this maximization criterion, Gauss used Bayes’ rule. Inspired by Laplace, he assumed a uniform prior over the parameters and chose the mode of the posterior distribution over the parameters as his estimator; this is equivalent to maximum likelihood estimation, see also Stigler (2008). ↩ Note that while I have mostly talked about “fitting lines” or “predicting values”, the words used during the discovery of least squares where “summarizing” or “aggregating” observations. For instance, they would say that, absent other information, the mean summarizes the data best while I would, using the language of this blog post, be forced to say the mean predicts the data best. I think that “summarizing” is more adequate than “predict”, especially since we are not predicting out of sample (see also Yarkoni & Westfall, 2017). ↩ Gauss introduced a different parameterization using precision instead of variance. The parameterization using variance was introduced by Fisher. (Karl Pearson used the standard deviation before.) ↩ The law states that “no scientific discovery is named after its original discoverer” (Stigler, 1980, p. 147). ↩ The proof is not difficult, and can be found in Blitzstein & Hwang (2014, p. 436), which is an amazing book on probability theory full of gems. I wholeheartedly recommend working through Blitzstein’s Stat 110 class — it’s one of the best classes I ever took (online, and in general). ↩In Review: Ten Great Ideas About Chance2019-01-11T15:30:00+00:002019-01-11T15:30:00+00:00https://fabiandablander.com/statistics/Ten-Great-Ideas<p><em>The blog post reviews and summarizes the book “Ten Great Ideas about Chance” by Diaconis and Skyrms. A much shorter version of this review has been published in Significance, see <a href="https://rss.onlinelibrary.wiley.com/doi/10.1111/j.1740-9713.2018.01217.x">here</a>.</em></p>
<p><img src="../assets/img/book-great-ideas.png" align="left" style="padding: 10px 10px 10px 0px;" width="250" height="250" /></p>
<p>In ten short chapters, Persi Diaconis and Brian Skyrms provide a bird’s eye perspective on probability theory and its connection to other disciplines. The book grew out of a course which the authors taught at Stanford, and is intended as a “history book, a statistics book, and a philosophy book”.</p>
<p>The first great idea is that chance can be measured. Here, the authors take us back into the 17th and 18th century, to scholars such as Cardano, Pascal, Fermat, Newton, and the Bernoullis who arrived of what one would call today the <em>naive</em> definition of probability by counting equiprobable cases. As gambling was a common hobby back then, the discussion focused on the notion of fairness. Say we engage in a gamble which requires throwing a coin a hundred times, but get interrupted mid-game by a war or a revolution. How should we split the money? With Newtonian physics in hand, John Arbuthnot — who also conducted the first significance test — ponders the role of chance in a deterministic universe. (But wait until chapter nine.) The authors’ fascination for probability is contagious, and it’s hard to suppress a smile when they discuss a deterministic coin-tossing machine that the physics department built for them. The coins would always land on the same side, “viscerally disturbing” the authors (one of which, by the way, is a former professional magician). The appendix also shatters the idealization of introductory probability textbooks, concluding that ordinary coin tosses are biased: the coin lands the same way it starts with a probability of .51.</p>
<p>With 26 pages, the second great idea receives the most coverage. It concerns judgement, and the authors make it clear that if your judgements, that is, betting behaviour, do not accord with the rules of probability, then bookies can systematically make money off of you. But money is not the end-all be-all, as Gabriel Cramer and Daniel Bernoulli conclude, independently, in discussing the St. Petersburg paradox: Say I throw an unbiased coin until it lands tails. If it lands tails on the $n^{th}$ trial, you receive $2^n$ dollars. While the expected value of this proposal is infinite, you may hesitate even to pay the equivalent of a nice dinner to enter this game. These considerations led to utility theory, and the author’s summary of how John von Neumann, Oskar Morgenstern, and Frank Ramsey pioneered the measurement of utility is both accessible and fascinating.</p>
<p>The third great idea is that the logic of chance is different from the psychology of chance. Here we meet the messiness of the mind, and puzzle over several paradoxes that seemingly violate the axioms of utility theory introduced in the previous chapter. Diaconis and Skyrms are quick to point out that these violations are superficial: with a complete analysis that accounts for how a particular decision would make you feel, the paradoxes introduced by Allais and Ellsberg do not violate the axioms — they cease to test them. The authors go on to discuss the “heuristics and biases” program of Kahneman and Tversky, and this is where the discrepancy between utility theory as a prescriptive theory of choice and actual human behaviour cannot be explained away so easily.</p>
<p>The fourth great idea is the connection between frequency and chance via the law of large numbers; the second part of the chapter discusses the failure of frequentism to provide an adequate treatment of probability. We meet Jacob Bernoulli’s weak law of large numbers and his “swindle” — claiming to have solved the “inverse problem” of going from frequencies to chances without the use of prior probabilities. John Venn tried to formalize probability as the limit of a relative frequency. Within four pages, the authors discuss why this fails, and move on to von Mises’ treatment. Instead of assuming the existence of limiting relative frequencies, von Mises postulated this as a defining attribute of his “Kollektiv” — an infinite series which exhibits global order (i.e., converges to a limiting relative frequency) and local disorder (which is captured by randomness). What is the nature of a random sequence? Von Mises suggested that the limiting relative frequency of a series should be invariant under selection of subsequences chosen by a particular place-selection function. Difficulties arose in finding appropriate place-selection functions, and the authors pick up this idea again in chapter eight. Overall, the idealizations inherent in using frequency to define probability — as put forward by Venn and von Mises — lead to problems that ultimately render the account unfeasible. Although referred to in a future chapter, the appendix of chapter four does not seem to exist.</p>
<p>The fifth chapter represents a slight shift in treatment of the ideas; it concerns the integration of probability theory into mathematics proper, based on set-theoretic principles and measure theory. Naturally, this requires heavier mathematical notation and the introduction of some abstract concepts. The main character here is Andrew Kolmogorov, who provided a unifying treatment of probability in his 1933 book. Regardless of the interpretation of probability — frequentist or otherwise — its mathematical basis has become clear.</p>
<p>In chapter six, we finally meet the Presbyterian minister and hobby mathematician Thomas Bayes. His famous essay is viewed as an answer to Hume’s inductive skepticism. In contrast to Jacob Bernoulli’s “swindle”, Bayes truly solves the “inverse problem” of going from frequencies to chances which the authors illustrate with the problem Bayes starts his essay with: estimating the bias of a coin with a uniform prior. Bayes’ interpretation of probability concerns degrees of belief instead of limiting relative frequencies. Working on the same problem in France, Laplace derived the identical result as Bayes, but further discussed predictions, leading to his (in)famous “rule of succession”: the probability of a success on the next trial given m previous successes in n trials is (m + 1)/(n + 2). Diaconis and Skyrms extend the analysis of Laplace from the uniform to a Beta prior. They also tip their hat to what has been dubbed the “replicability crisis” in science: researchers — guided by Bayes — should report negative studies, pool evidence across studies, and consider prior probabilities. It is good to see such reflections on practical scientific matters in a book about probability.</p>
<p>Throughout the chapters, we read about probability, frequencies, and chances. But what are they? And how are they related? It takes some detective work to answer these questions. Probability, put on a solid mathematical basis by Kolmogorov, is interpreted by Bayes, Laplace, as well as Diaconis and Skyrms as degree of belief. Frequentists such as Venn and von Mises treat it as the limit of a relative frequency. (Chapter 4 discussed various defects of this perspective.) A relative frequency is something familiar; for instance, the ratio of m successes in n trials. But what is chance? Curiously for a book on the topic, the authors never give a definition. Throughout the first chapters, they equivocate “chance” with “probability”, for example when they define expectation as “weighting the costs and benefits of various outcomes by their chances” (p. 13), or explain that “humans often make mistakes in reasoning about chance” (p. 58). It becomes clear that these concepts are distinct only when the authors discuss how Bayes used a prior probability over chances to solve the “inverse problem”. In this example, chance is the rate parameter $\theta$ in a Binomial model (see also Lindley, 2013, p. 115). Chapter 7 describes how Bruno de Finetti, an Italian probabilist who spearheaded subjective Bayesianism in the 20th century, tied together these different concepts with his representation theorem: if your beliefs about certain outcomes are exchangeable, that is, their order does not influence your probability assignment, this is equivalent to conditional independence in the outcomes and uncertainty about the parameters (i.e., “chances”). More formally, a sequence of random quantities is said to be exchangeable if</p>
<p>\begin{equation}
p(x_1, x_2, \ldots , x_n) = p(x_{\pi(1)}, x_{\pi(2)}, \ldots, x_{\pi(n)})
\end{equation}</p>
<p>holds for all permutations $\pi$ defined on the set ${1, \ldots, n}$ and for every finite subset of them. de Finetti’s theorem shows that if this condition holds, then there exists a parametric model $p(x \mid \theta)$ governed by some parameter $\theta$ which, as n tends to infinity, is the limit of some function of the $x_i$’s (in the Binomial case, a relative frequency!), and there exists a probability distribution over $\theta$, $p(\theta)$, such that</p>
<p>\begin{equation}
p(x_1, \ldots, x_n) = \int_{\Theta} \prod_{i=1}^n p(x_i \mid \theta) p(\theta) \mathrm{d}\theta
\end{equation}</p>
<p>(see also Bernardo, 1996). With this, de Finetti “showed how chance, frequency, and degree of belief all interact to give statistical inference” (p. 133).</p>
<p>From the securities of banks to statistical simulations, much of modern technology depends on randomness. But what makes a sequence of numbers random? Von Mises viewed the impossibility of a successful gambling system — this notion is made precise by <em>martingales</em> — as an essential feature of a random sequence. Chapter eight shows how von Mises’ formalism of randomness proved inadequate, and introduces Martin-Löf’s modern theory of algorithmic randomness, which depends on the notion of computability, as the great idea of this chapter. Equipped with these ideas, a random sequence is simply a sequence which passes (increasingly stringent) Martin-Löf tests of randomness. Validating von Mises’ intuition, a successful gambling system is impossible if and only if it is Martin-Löf random.</p>
<p>If Laplace’s demon existed — if, given the current state of all atoms, it could predict all future states — what room is left for probability? What is the role of chance in such a deterministic universe? Already in the 17th century, and discussed in the first chapter, Arbuthnot ponders the role of chance in a deterministic universe. Equipped with Newtonian mechanics, he concludes that chance is merely an artefact of our ignorance. In chapter nine, Diaconis and Skyrms argue that our world is fundamentally a world of chance. They make this point in two rounds. First, the authors discuss why the second law of thermodynamics required a probabilistic reformulation and how the sensitive dependence on initial conditions in dynamical systems makes clear-cut predictions impossible; on this level, Diaconis and Skyrms might well grant Laplace his demon, and agree with Arbuthnot: probability enters classical physics due to the impossibility of exact predictions (which is due to our ignorance). Quantum mechanics — according to the orthodox view — however, “sets the world at chance” (p. 180). Here, as the authors note, predictions appear to be uncertain at a deeper level.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>Each chapter includes images of important historical figures. In chapter ten we see Karl Popper, as if in despair, putting both hands on his head. The last great idea concerns the problem of induction, and Diaconis and Skyrms argue that inductive skepticism can be constrained. “Trying to answer a thoroughgoing skeptic is a fool’s game,” the authors state.</p>
<blockquote>
<p>“But it is possible, and sometimes quite reasonable, to be skeptical about some things but not others. Thus, there are <em>grades of inductive skepticism</em>, which differ in what the skeptic calls into question and what he is willing to accept. For each grade, a discussion of whether such a skeptic’s doubts are justified <em>in his own terms</em> might actually be worthwhile.”</p>
</blockquote>
<p>Take a simple coin-tossing example. After observing a few tosses, why should you believe the coin’s future behaviour would resemble its past behaviour? This is Hume’s struggle, and de Finetti provides an answer: if you believe that all outcome patterns with the same number of successes are equally likely, then you must believe that the frequency of successes in the past must be similar to the frequency of successes in the future, and that this frequency converges. Thus, a skeptic with exchangeable beliefs is incoherent; or, as the authors put it: “<em>If your degrees of beliefs are exchangeable, you cannot be an inductive skeptic</em>” (p. 200). Diaconis and Skyrms go through the usual objections (e.g., how to specify the prior, what if exchangeability does not hold), ending the book by tying together various notions explicated in the previous chapters.</p>
<p>The authors are luminaries in their respective fields (statistics and philosophy), and it is exciting to see them take the time to share their thoughts with a more popular audience. They also know how to condense their vast knowledge concisely. This is apparent when, for example, they review the early history of probability in a mere eight pages, Martin-Löf tests of randomness in two, or Bell’s theorem in three. While they suggest a prerequisite of one course in statistics or probability suffices, I think this is overly optimistic: it may hold for the first four chapters, but readers might puzzle over expressions such as unique smallest Borel field, dense explanations of exchangeability, or Lyapunov exponents. An example is their treatment of Church’s notion of <em>computability</em>, which is condensed into four <em>paragraphs</em>. I doubt that these paragraphs allow the naive reader to reduce her uncertainty concerning computability. In fact, they may well have increased it.</p>
<p>But then this is a book which aims to provide an overview of the vast web spanned by probability theory in a mere two hundred pages. It is clear that, with such an ambitious goal, some confusion must remain. The authors remedy this to some degree by means of extensive referencing in footnotes<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> and a selected bibliography (most of which, however, incomprehensible to somebody without a solid understanding of probability) — but this fuels the critic’s suspicion that the book is a polished version of the author’s lecture notes of their Stanford class. This particular critic may long for the release of the video lectures in which Diaconis and Skyrms would explicate the ideas discussed in the book in more depth.</p>
<p>The ten great ideas are well selected and may complement a reading of Stigler’s book on the seven pillars of statistical wisdom (Stigler, 2016). The ordering of the ideas — from “easier” ones concerning history and psychology to “harder” ones concerning theoretical computer science and quantum physics — makes sense, too, and it is good to see later chapters revisit earlier ideas. Have they missed any big ideas? I don’t think so. But they may have omitted minor ones: for example, that Laplace’s rule of succession is inadequate to establish a general law, as it expresses a strong prior belief that no such law holds. A remedy, which pushed Harold Jeffreys to develop the Bayes factor, is to assign a prior probability to the general law (Zabell, 2005, pp. 58).</p>
<p><em>Ten Great Ideas about Chance</em> is well illustrated, and the cover looks great. If you find
yourself discovering it in a bookstore, browsing through its pages, you will certainly be drawn
to the enthusiasm that shines through the concise writing. You decide to buy the book. There
are two ways in which this story can continue. A “pessimist” will puzzle over the book; she will find key concepts only hinted at, their deeper meaning lost in the rapid succession of ideas; she will struggle through most of the chapters, and while each individual sentence makes sense to her, she will have difficulties grasping the bigger picture; lured in by the layout, she expected a popular treatment of probability theory, but is quickly forced to adjust her expectations; she is frustrated and not willing to invest the extra work into understanding the ten great ideas. An “optimist” will puzzle over the book; she will marvel at the great ideas and discuss them with her peers (some of whom have a better grasp on probability); she will occasionally be annoyed by the brevity of exposition, but consult the references to gain deeper insight; she soon realizes that this book is not light reading, and treats it as such.</p>
<p>As illustrated by the above story, I cannot unreservedly recommend the book. In fact, I
have found myself alternating between the position of the pessimist and of the optimist, and
your own transition probability will vary depending on your background and expectations. I
cannot influence the former, but hope that, with this review, I have given you an impression
that can either heighten or curb your enthusiasm for “Ten Great Ideas about Chance”.</p>
<h2 id="references">References</h2>
<ul>
<li>
<p>Bernardo, J. M. (1996). The Concept of Exchangeability and Its Applications. <em>Far East Journal of Mathematical Sciences</em>, 111–122.</p>
</li>
<li>
<p>Diaconis, P. & Skyrms, B. (2017). <em>Ten Great Ideas About Chance</em>. Princeton University Press.</p>
</li>
<li>
<p>Lindley, D. V. (2013). <em>Understanding Uncertainty</em>. John Wiley & Sons.</p>
</li>
<li>
<p>Stigler, S. M. (2006). Isaac Newton as a Probabilist. <em>Statistical Science</em>, <em>23(1)</em> 400–403.</p>
</li>
<li>
<p>Stigler, S. M. (2016). <em>The Seven Pillars of Statistical Wisdom</em>. Harvard University Press.</p>
</li>
<li>
<p>Zabell, S. L. (2005). <em>Symmetry and Its Discontents: Essays on the History of Inductive Probability</em>. Cambridge University Press.</p>
</li>
</ul>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>However, not all physicists agree with the <em>Copenhagen interpretation</em>, i.e., the orthodox view. A non-negligible percentage espouses the <em>Many-worlds interpretation</em>, which is deterministic. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Awkwardly, one footnote claims that Newton was a poor probabilist, referencing Stigler (2004). The correct reference is Stigler (2006), in which Stigler claims the exact opposite, stating that “Newton was thinking like a great probabilist.” <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderThe blog post reviews and summarizes the book “Ten Great Ideas about Chance” by Diaconis and Skyrms. A much shorter version of this review has been published in Significance, see here.