Jekyll2021-03-29T10:14:46+00:00https://fabiandablander.com/feed.xmlFabian DablanderPhD Student Methods & StatisticsFabian DablanderCausal effect of Elon Musk tweets on Dogecoin price2021-02-07T13:30:00+00:002021-02-07T13:30:00+00:00https://fabiandablander.com/r/Causal-Doge<!-- Two weeks ago I made a couple of hundred bucks in the Dogecoin mania. It was strangely exhilarating to follow, in real-time, how a pump-and-dump scheme evolves. I feel bad because (a) I could have made more, and (b) it's a zero-sum game --- somebody lost money, potentially somebody who needs it more than me. (Just kidding. If you're "investing" in Dogecoin, all bets are off.) -->
<p>If you think of Dogecoin — the cryptocurrency based on a meme — you can’t help but also think of Elon Musk. That guy loves the doge, and every time he tweets about it, the price goes up. While we all know that <a href="https://fabiandablander.com/r/Causal-Inference.html">correlation is not causation</a>, we might still be able to quantify the causal effect of Elon Musk’s tweets on the price of Dogecoin. Sounds adventurous? That’s because it is! So buckle up before scrolling down.</p>
<h2 id="tanking-tesla">Tanking Tesla</h2>
<p>Elon Musk is notorious for being able to swing markets. In a <a href="https://www.alexpghayes.com/blog/elon-musk-send-tweet/">great blog post</a> from last year, Alex Hayes used the S&P500 as a control to estimate the causal effect of the tweet below on Tesla’s stock price. He used the excellent <a href="https://google.github.io/CausalImpact/"><em>CausalImpact</em></a> R package developed by Brodersen et al. (<a href="https://projecteuclid.org/euclid.aoas/1430226092">2015</a>). I quickly reproduced his analysis, see below and the <em>Post Scriptum</em>.</p>
<p><img src="../assets/img/Tesla-Tweet.png" /></p>
<p>The vertical dashed line indicates the timing of Elon’s tweet, which was around 15:11 UTC, which is 16:11 CET (central European winter time) and 17:11 CEST (central European summer time). The black line gives Tesla’s stock price. The blue dashed line gives the model’s prediction of Tesla’s stock price using the S&P500 as a control (see Brodersen, <a href="https://projecteuclid.org/euclid.aoas/1430226092">2015</a>, for details on the model). We see that, prior to the tweet, the predictions align well with Tesla’s actual stock price. The time zone throughout the remainder of this blog post, by the way, is CET.</p>
<p>Using the S&P500, Alex predicted what Tesla’s share price would have been had Elon not tweeted. The difference between that prediction and the actual trajectory of Tesla’s stock price is an estimate of the causal effect. This assumes that there were no other events besides Elon’s tweet that influenced Tesla’s stock price but did not influence the S&SP500 at the time; that the tweet did not influence the S&P500 itself (Tesla was not in the S&P500 back then); that the relationship between Tesla and the S&P500 holds after the post-tweet period; and that there is no hidden variable that caused both Elon to tweet and Tesla to tank. (And, of course, that <a href="https://twitter.com/fdabl/status/1110944752571158528">counterfactuals make sense</a>.)</p>
<h2 id="moonshooting-dogecoin-part-i">Moonshooting Dogecoin: Part I</h2>
<p>Let’s turn to the recent Dogecoin mania. The figure below shows the price of Dogecoin and Bitcoin for a selected period of time (see the <em>Post Scriptum</em> for how to get the data).</p>
<p><img src="/assets/img/2021-02-07-Causal-Doge.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>Dogecoin exploded that week, largely because Redditors <a href="https://www.forbes.com/sites/roberthart/2021/01/28/its-doge-time-dogecoin-surges-as-reddit-traders-push-to-make-it-the-crypto-gamestop/?sh=5dad096b217e">rallied around</a> it after shooting GameStop to the moon. There are currently about <a href="https://en.wikipedia.org/wiki/Bitcoin">18 million Bitcoins</a> in circulation, and there is a maximum supply of 21 million. There are about <a href="https://en.wikipedia.org/wiki/Bitcoin">127 billion Dogecoins</a> in circulation, and in contrast to Bitcoin, there is no upper limit to what that number can be.</p>
<p>To better compare the two time-series, we standardize them (with respect to themselves) in the figure below.</p>
<p><img src="/assets/img/2021-02-07-Causal-Doge.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>We see that Bitcoin is more volatile at the beginning, but that both cryptocurrencies increase starting at around 28th January 12:00. The vertical black line indicates the time Elon Musk fired off a tweet. What did he share with the world?</p>
<p><img src="../assets/img/Doge-1.png" /></p>
<p>Haha, that’s great stuff … what’s the causal effect of this tweet? Since the S&P500 is in quite a different class than cryptocurrencies, I use Bitcoin to predict the counterfactual Dogecoin price. I use a subset of the above data, starting from 12:00 on the 28th of January, as Bitcoin does not track Dogecoin particularly well before. Similarly, I only look at a subset of the data after the tweet. This is because cryptocurrencies are extremely volatile, and the causal effect of Elon Musk’s tweet may thus wash out rather quickly.</p>
<p>Using the wonderful <em>CausalImpact</em> R package, we get the following result (see also the <em>Post Scriptum</em>).</p>
<p><img src="/assets/img/2021-02-07-Causal-Doge.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>We see that the model predicts the price of Dogecoin reasonably well prior to Elon’s tweet. The counterfactual Dogecoin price (that is, the price of Dogecoin had Elon not tweeted) is predicted to stay rather flat, while the actual price rises. Yet it does not rise immediately, but with a delay — maybe because he tweeted in the middle of the night? In any event, Dogecoin showed an average increase of 33% (with a 95% credible interval ranging from 23% to 42%), but note that this estimate naturally depends on the post-tweet time frame we consider. In particular, the previous figure showed that the Dogecoin price dips after the initial increase. Overall, however, it does seem that Elon’s tweet had a substantial causal effect on the price of Dogecoin.</p>
<p>Recall that the analysis assumes that there were no other events at the time that selectively influenced Dogecoin but not Bitcoin. However, Redditors rallied around the cryptocurrency at the same time, very likely confounding the tweet’s causal effect. Luckily for us, Elon struck twice.</p>
<h2 id="moonshooting-dogecoin-part-ii">Moonshooting Dogecoin: Part II</h2>
<p>A week after the initial frenzy, Musk fired off a series of tweets about Dogecoin. Let’s zoom in on the data.</p>
<p><img src="/assets/img/2021-02-07-Causal-Doge.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>The vertical black line indicate the time of the first of Elon’s tweets, after which several others followed. What insights can we glean from them?</p>
<p><img src="../assets/img/Doge-2.png" /></p>
<p>Cool, cool. Dogecoin rose substantially after this avalance of tweets. But again, this does not mean Elon’s tweets caused the price to rise. To assess whether these tweets had a causal effect, I employ the same analysis as above. Since Musk tweeted several times, I take the first tweet as the reference point. Similar to above, I only select a subset of the data, this time starting from 3th February 12:00.</p>
<p><img src="/assets/img/2021-02-07-Causal-Doge.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>The average causal effect estimate is a price increase of 23%, with a 95% credible interval between 19% and 28% (but note again that this is sensitive to the extent of the post-tweet time period we consider). There is little delay between the first tweet and the price rise, and Redditors rallying around Dogecoin is not as big of a concern as it was previously. But the counterfactual predictions seem somewhat less convincing than before, reflecting the rather poor correlation between Dogecoin and Bitcoin pre-tweet. The method naturally acounts for uncertainty (for details, see Brodersen et al., <a href="https://projecteuclid.org/euclid.aoas/1430226092">2015</a>).</p>
<h2 id="conclusion">Conclusion</h2>
<p>Causal inference always comes with assumptions. Here, we asssumed that there was no other event that influenced the price of Dogecoin but not the price of Bitcoin at the time of Elon Musk’s tweets, and that there was no third variable that caused both Musk to tweet and Dogecoin to rise. These assumptions seem more plausible in the second analysis than in the first.</p>
<p>We also assumed that Bitcoin prices track Dogecoin prices reasonably well, and that the relation persists after the tweets. One could sanity-check how suitable Bitcoin is as a control by running the analysis on various subsets of the data, and comparing the predicted Dogecoin price with the actual Dogecoin price. But since there is only so much time I want to spend thinking about Dogecoin on a Sunday afternoon, I leave this validation to others.</p>
<p>One could probably come up with a better control by combining several different cryptocurrencies instead of relying only on Bitcoin — or drop the whole control spiel and slap a Gaussian process on the doge in an interrupted time-series manner (e.g., Leeftink & Hinne, <a href="http://proceedings.mlr.press/v136/leeftink20a.html">2020</a>). On a more philosophical note, the analysis assumes that counterfactual statements make sense, which is not uncontroversial (e.g., Dawid, <a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2000.10474210">2000</a>; Peters, Janzing, & Schölkopf, <a href="https://mitpress.mit.edu/books/elements-causal-inference">2017</a>, p. 106).</p>
<p>The analysis further assumes that Bitcoin prices are not influenced by Musk’s tweets. If they were influenced by them — say they cause a rise in Bitcoin prices — then the causal effect on Dogecoin would be downward biased. It seems likely that Musk’s tweets, if they were to influence Dogecoin, would also influence Bitcoin (e.g., simply by drawing attention to cryptocurrencies), and so if one were really interested in an unbiased — or rather, <em>less biased</em> — estimate, one would have to think harder.</p>
<p>Elon Musk has 46 million Twitter followers, and while I would not trust the precise causal effect estimates we arrived at in this blog post, it seems pretty plausible to me that he could influence the price of Dogecoin by mere key strokes. I don’t think, however, that this is a good thing.</p>
<hr />
<p><em>I would like to thank <a href="https://twitter.com/abacilieri">Andrea Bacilieri</a> for very helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<p>The code below gets the relevant data sets from <a href="https://www.tiingo.com/">Tiingo</a> using the <em>riingo</em> R package. This requires an API key, but you can download the data from <a href="https://fabiandablander.com/assets/data/tesla-data.csv">here</a> (for the Tesla re-analysis) and <a href="https://fabiandablander.com/assets/data/doge-data-1.csv">here</a> and <a href="https://fabiandablander.com/assets/data/doge-data-2.csv">here</a> (for the two Dogecoin analyses) in case you do not want to create an account.</p>
<h3 id="tesla-analysis">Tesla Analysis</h3>
<p>The code below reproduces the analysis by Alex Hayes. Note that Musk tweeted in May, in which central Europe is in summer time (CEST), which is UTC+02:00 and not UTC+01:00 … don’t get me started.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'riingo'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'ggplot2'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'CausalImpact'</span><span class="p">)</span><span class="w">
</span><span class="c1"># riingo uses UTC</span><span class="w">
</span><span class="c1"># CET is UTC+01:00</span><span class="w">
</span><span class="c1"># CEST is UTC+02:00</span><span class="w">
</span><span class="c1"># Musk tweeted during summer time</span><span class="w">
</span><span class="n">start</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="s1">'2020-05-01 11:00:00 UTC'</span><span class="p">,</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="p">)</span><span class="w">
</span><span class="n">end</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="s1">'2020-05-01 19:00:00 UTC'</span><span class="p">,</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="p">)</span><span class="w">
</span><span class="n">tweet</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="s1">'2020-05-01 15:11:00 UTC'</span><span class="p">,</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="p">)</span><span class="w">
</span><span class="n">tesla</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">riingo_iex_prices</span><span class="p">(</span><span class="w">
</span><span class="s1">'TSLA'</span><span class="p">,</span><span class="w"> </span><span class="n">start_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'2020-05-01'</span><span class="p">,</span><span class="w">
</span><span class="n">end_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'2020-05-01'</span><span class="p">,</span><span class="w"> </span><span class="n">resample_frequency</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'1min'</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">end</span><span class="p">)</span><span class="w">
</span><span class="n">sp500</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">riingo_iex_prices</span><span class="p">(</span><span class="w">
</span><span class="s1">'SPY'</span><span class="p">,</span><span class="w"> </span><span class="n">start_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'2020-05-01'</span><span class="p">,</span><span class="w">
</span><span class="n">end_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'2020-05-01'</span><span class="p">,</span><span class="w"> </span><span class="n">resample_frequency</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'1min'</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">date</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="n">end</span><span class="p">)</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">tesla</span><span class="o">$</span><span class="n">date</span><span class="w">
</span><span class="n">tweet_ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">tweet</span><span class="p">)</span><span class="w">
</span><span class="n">tofit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">zoo</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">tesla</span><span class="o">$</span><span class="n">close</span><span class="p">,</span><span class="w"> </span><span class="n">sp500</span><span class="o">$</span><span class="n">close</span><span class="p">),</span><span class="w"> </span><span class="n">times</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">CausalImpact</span><span class="p">(</span><span class="n">tofit</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">tweet_ix</span><span class="p">)],</span><span class="w"> </span><span class="n">times</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="n">tweet_ix</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">times</span><span class="p">))])</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">fit</span><span class="p">,</span><span class="w"> </span><span class="s1">'original'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s1">'Time'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Price ($)'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Tweeting about Tesla'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="w">
</span><span class="n">axis.text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">text_size</span><span class="p">),</span><span class="w">
</span><span class="n">axis.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">axis_size</span><span class="p">),</span><span class="w">
</span><span class="n">plot.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title_size</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<h3 id="dogecoin-analysis">Dogecoin Analysis</h3>
<p>The code below gets the data set.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">get_data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">start_date</span><span class="p">,</span><span class="w"> </span><span class="n">end_date</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># Get Bitcoin in euros</span><span class="w">
</span><span class="n">bit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">riingo_crypto_prices</span><span class="p">(</span><span class="w">
</span><span class="s1">'btceur'</span><span class="p">,</span><span class="w"> </span><span class="n">start_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">start_date</span><span class="p">,</span><span class="w">
</span><span class="n">end_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">end_date</span><span class="p">,</span><span class="w"> </span><span class="n">resample_frequency</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'1min'</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">crypto</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Bitcoin'</span><span class="p">)</span><span class="w">
</span><span class="c1"># We get Dogecoin in Bitcoin, then convert it to euros</span><span class="w">
</span><span class="n">doge</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">riingo_crypto_prices</span><span class="p">(</span><span class="w">
</span><span class="s1">'dogebtc'</span><span class="p">,</span><span class="w"> </span><span class="n">start_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">start_date</span><span class="p">,</span><span class="w">
</span><span class="n">end_date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">end_date</span><span class="p">,</span><span class="w"> </span><span class="n">resample_frequency</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'1min'</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">mutate</span><span class="p">(</span><span class="n">crypto</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Dogecoin'</span><span class="p">)</span><span class="w">
</span><span class="c1"># Join data frames (and keep only rows where we have dogecoin and bitcoin data)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">full_join</span><span class="p">(</span><span class="n">doge</span><span class="p">,</span><span class="w"> </span><span class="n">bit</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">date</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">(),</span><span class="w">
</span><span class="n">price</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">close</span><span class="p">,</span><span class="w">
</span><span class="n">crypto</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">factor</span><span class="p">(</span><span class="n">crypto</span><span class="p">,</span><span class="w"> </span><span class="n">levels</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Dogecoin'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Bitcoin'</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="c1"># Convert dogecoin price to be relative euro, not relative to bitcoin</span><span class="w">
</span><span class="n">dat</span><span class="p">[</span><span class="n">dat</span><span class="o">$</span><span class="n">crypto</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'Dogecoin'</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="o">$</span><span class="n">price</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="w">
</span><span class="n">dat</span><span class="p">[</span><span class="n">dat</span><span class="o">$</span><span class="n">crypto</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'Dogecoin'</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="o">$</span><span class="n">close</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dat</span><span class="p">[</span><span class="n">dat</span><span class="o">$</span><span class="n">crypto</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'Bitcoin'</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="o">$</span><span class="n">close</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># dat <- get_data(start_date = '2021-01-27', end_date = '2021-01-30')</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s1">'http://fabiandablander.com/assets/data/doge-data-1.csv'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="w">
</span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>The analysis code for the causal effect of the first tweet is shown below.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">tweets</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="w">
</span><span class="nf">c</span><span class="p">(</span><span class="w">
</span><span class="s1">'2021-01-28 22:47:00 UTC'</span><span class="p">,</span><span class="w">
</span><span class="s1">'2021-02-04 07:29:00 UTC'</span><span class="p">,</span><span class="w">
</span><span class="s1">'2021-02-04 08:15:00 UTC'</span><span class="p">,</span><span class="w">
</span><span class="s1">'2021-02-04 07:57:00 UTC'</span><span class="p">,</span><span class="w">
</span><span class="s1">'2021-02-04 08:27:00 UTC'</span><span class="w">
</span><span class="p">),</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit_model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">datsel</span><span class="p">,</span><span class="w"> </span><span class="n">tweet_time</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">doge</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">datsel</span><span class="p">,</span><span class="w"> </span><span class="n">crypto</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'Dogecoin'</span><span class="p">)</span><span class="w">
</span><span class="n">bit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">datsel</span><span class="p">,</span><span class="w"> </span><span class="n">crypto</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'Bitcoin'</span><span class="p">)</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">doge</span><span class="o">$</span><span class="n">date</span><span class="w">
</span><span class="n">tofit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">zoo</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">doge</span><span class="o">$</span><span class="n">price</span><span class="p">,</span><span class="w"> </span><span class="n">bit</span><span class="o">$</span><span class="n">price</span><span class="p">),</span><span class="w"> </span><span class="n">times</span><span class="p">)</span><span class="w">
</span><span class="n">tweet_ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">which</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="n">tweet_time</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">CausalImpact</span><span class="p">(</span><span class="w">
</span><span class="n">tofit</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">tweet_ix</span><span class="p">)],</span><span class="w"> </span><span class="n">times</span><span class="p">[</span><span class="nf">c</span><span class="p">(</span><span class="n">tweet_ix</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">times</span><span class="p">))]</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Select subset of data for analysis</span><span class="w">
</span><span class="n">start_analysis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="s1">'2021-01-28 11:00:00 UTC'</span><span class="p">,</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="p">)</span><span class="w">
</span><span class="n">end_analysis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="s1">'2021-01-29 01:00:00 UTC'</span><span class="p">,</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="p">)</span><span class="w">
</span><span class="n">datsel</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">dat</span><span class="p">,</span><span class="w"> </span><span class="n">between</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">start_analysis</span><span class="p">,</span><span class="w"> </span><span class="n">end_analysis</span><span class="p">))</span><span class="w">
</span><span class="n">fit1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fit_model</span><span class="p">(</span><span class="n">datsel</span><span class="p">,</span><span class="w"> </span><span class="n">tweets</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">fit1</span><span class="p">,</span><span class="w"> </span><span class="s1">'original'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s1">'Time'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Price (€)'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Tweeting about Dogecoin (28th January)'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="w">
</span><span class="n">axis.text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">text_size</span><span class="p">),</span><span class="w">
</span><span class="n">axis.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">axis_size</span><span class="p">),</span><span class="w">
</span><span class="n">plot.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title_size</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>The analysis code for the causal effect of the later avalanche of tweets is shown below. For some reason, <em>riingo</em> has lots of missing data during that time period. Thus I downloaded the cryptocurrency data from <a href="https://www.cryptoarchive.com.au/">here</a>.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">dat2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s1">'https://fabiandablander.com/assets/data/doge-data-2.csv'</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="w">
</span><span class="n">date</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># Select subset of data for analysis</span><span class="w">
</span><span class="n">start_analysis2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="s1">'2021-02-03 12:00:00 UTC'</span><span class="p">,</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="p">)</span><span class="w">
</span><span class="n">end_analysis2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.POSIXct</span><span class="p">(</span><span class="s1">'2021-02-04 10:00:00 UTC'</span><span class="p">,</span><span class="w"> </span><span class="n">tz</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'UTC'</span><span class="p">)</span><span class="w">
</span><span class="n">datsel2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">dat2</span><span class="p">,</span><span class="w"> </span><span class="n">between</span><span class="p">(</span><span class="n">date</span><span class="p">,</span><span class="w"> </span><span class="n">start_analysis2</span><span class="p">,</span><span class="w"> </span><span class="n">end_analysis2</span><span class="p">))</span><span class="w">
</span><span class="n">fit2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">fit_model</span><span class="p">(</span><span class="n">datsel2</span><span class="p">,</span><span class="w"> </span><span class="n">tweets</span><span class="p">[</span><span class="m">2</span><span class="p">])</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="n">fit2</span><span class="p">,</span><span class="w"> </span><span class="s1">'original'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">xlab</span><span class="p">(</span><span class="s1">'Time'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ylab</span><span class="p">(</span><span class="s1">'Price (€)'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">ggtitle</span><span class="p">(</span><span class="s1">'Tweeting about Dogecoin (4th February)'</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="n">theme</span><span class="p">(</span><span class="w">
</span><span class="n">axis.text</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">text_size</span><span class="p">),</span><span class="w">
</span><span class="n">axis.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">axis_size</span><span class="p">),</span><span class="w">
</span><span class="n">plot.title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">element_text</span><span class="p">(</span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title_size</span><span class="p">,</span><span class="w"> </span><span class="n">hjust</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>Fabian DablanderIf you think of Dogecoin — the cryptocurrency based on a meme — you can’t help but also think of Elon Musk. That guy loves the doge, and every time he tweets about it, the price goes up. While we all know that correlation is not causation, we might still be able to quantify the causal effect of Elon Musk’s tweets on the price of Dogecoin. Sounds adventurous? That’s because it is! So buckle up before scrolling down. Tanking Tesla Elon Musk is notorious for being able to swing markets. In a great blog post from last year, Alex Hayes used the S&P500 as a control to estimate the causal effect of the tweet below on Tesla’s stock price. He used the excellent CausalImpact R package developed by Brodersen et al. (2015). I quickly reproduced his analysis, see below and the Post Scriptum. The vertical dashed line indicates the timing of Elon’s tweet, which was around 15:11 UTC, which is 16:11 CET (central European winter time) and 17:11 CEST (central European summer time). The black line gives Tesla’s stock price. The blue dashed line gives the model’s prediction of Tesla’s stock price using the S&P500 as a control (see Brodersen, 2015, for details on the model). We see that, prior to the tweet, the predictions align well with Tesla’s actual stock price. The time zone throughout the remainder of this blog post, by the way, is CET. Using the S&P500, Alex predicted what Tesla’s share price would have been had Elon not tweeted. The difference between that prediction and the actual trajectory of Tesla’s stock price is an estimate of the causal effect. This assumes that there were no other events besides Elon’s tweet that influenced Tesla’s stock price but did not influence the S&SP500 at the time; that the tweet did not influence the S&P500 itself (Tesla was not in the S&P500 back then); that the relationship between Tesla and the S&P500 holds after the post-tweet period; and that there is no hidden variable that caused both Elon to tweet and Tesla to tank. (And, of course, that counterfactuals make sense.) Moonshooting Dogecoin: Part I Let’s turn to the recent Dogecoin mania. The figure below shows the price of Dogecoin and Bitcoin for a selected period of time (see the Post Scriptum for how to get the data).A gentle introduction to dynamical systems theory2020-12-17T12:30:00+00:002020-12-17T12:30:00+00:00https://fabiandablander.com/r/Dynamical-Systems<p>Dynamical systems theory provides a unifying framework for studying how systems as disparate as the climate and the behaviour of humans change over time. In this blog post, I provide an introduction to some of its core concepts. Since the study of dynamical systems is vast, I will barely scratch the surface, focusing on low-dimensional systems that, while rather simple, nonetheless show interesting properties such as multiple stable states, critical transitions, hysteresis, and critical slowing down.</p>
<p>While I have previously written about linear differential equations (in the context of <a href="https://fabiandablander.com/r/Linear-Love.html">love affairs</a>) and nonlinear differential equations (in the context of <a href="https://fabiandablander.com/r/Nonlinear-Infection.html">infectious diseases</a>), this post provides a gentler introduction. If you have not been exposed to dynamical systems theory before, you may find this blog post more accessible than the other two.</p>
<p>The bulk of this blog post may be read as a preamble to Dablander, Pichler, Cika, & Bacilieri (<a href="https://psyarxiv.com/5wc28">2020</a>), who provide an in-depth discussion of early warning signals and critical transitions. I recently gave a talk on this work and had the chutzpah to have it be <a href="https://www.youtube.com/watch?v=055Ou_aqKUQ">recorded</a> (with slides available from <a href="https://fabiandablander.com/assets/talks/Early-Warning.html">here</a>). The first thirty minutes or so cover part of what is explained here, in case you prefer frantic hand movements to the calming written word. But without any further ado, let’s dive in!</p>
<h1 id="differential-equations">Differential equations</h1>
<p>Dynamical systems are systems that change over time. The dominant way of modeling how such systems change is by means of differential equations. Differential equations relate the rate of change of a quantity $x$ — which is given by the time derivative $\frac{\mathrm{d}x}{\mathrm{d}t}$ — to the quantity itself:</p>
\[\frac{\mathrm{d}x}{\mathrm{d}t} = f(x) \enspace .\]
<p>If we knew the function $f$, then this differential equation would give us the rate of change for any value of $x$. We are not particularly interested in this rate of change per se, however, but at the value of $x$ as a function of time $t$. We call the function $x(t)$ the <em>solution</em> of the differential equation. Most differential equations cannot be solved analytically, that is, we cannot get a closed-form expression of $x(t)$. Instead, differential equations are frequently solved numerically.</p>
<p>How the system changes as a function of time, given by $x(t)$, is implicitly encoded in the differential equation. This is because, given any particular value of $x$, $f(x)$ tells us in which direction $x$ will change, and how quickly. It is this fact that we exploit when numerically solving differential equations. Specifically, given an initial condition $x_0 \equiv x(t = 0)$, $f(x_0)$ tells us in which direction and how quickly the system is changing. This suggests the following approximation method:</p>
\[x_{n + 1} = x_n + \Delta_t \cdot f(x_n) \enspace ,\]
<p>where $n$ indexes the set {$x_0, x_1, \ldots$} and $\Delta_t$ is the time that passes between two iterations. This is the most primitive way of numerically solving differential equations, known as <a href="https://en.wikipedia.org/wiki/Euler_method">Euler’s method</a>, but it will do for this blog post. The derivative $\frac{\mathrm{d}x}{\mathrm{d}t}$ tells us how $x$ changes in an <em>infinitesimally</em> small time interval $\Delta_t$, and so for sufficiently small $\Delta_t$ we can get a good approximation of $x(t)$. In computer code, Euler’s method looks something like this:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">f</span><span class="p">,</span><span class="w"> </span><span class="n">x0</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">xt</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">x0</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">xt</span><span class="p">[</span><span class="n">n</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">xt</span><span class="p">[</span><span class="n">n</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">f</span><span class="p">(</span><span class="n">xt</span><span class="p">[</span><span class="n">n</span><span class="m">-1</span><span class="p">])</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">delta_t</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">xt</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The equation above is a deterministic differential equation — randomness does not enter the picture. If one knows the initial condition, then one can perfectly predict the state of the system at any time point $t$.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup> In the next few sections, we use simple differential equations to model how a population grows over time.</p>
<h1 id="modeling-population-growth">Modeling population growth</h1>
<p>Relatively simple differential equations can lead to surprisingly intricate behaviour. Over the next few sections, we will discover this by slowly extending a simple model for population growth.</p>
<h2 id="exponential-growth">Exponential growth</h2>
<p>In his 1798 <em>Essay on the Principle of Population</em>, Thomas Malthus noted the problems that may come about when the growth of a population is proportional to its size.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup> Letting $N$ be the number of individuals in a population, we may formalize such a growth process as:</p>
\[\frac{\mathrm{d}N}{\mathrm{d}t} = r N \enspace ,\]
<p>which states that the change in population size is proportional to itself, with $r > 0$ being a parameter indicating the growth rate. Using $r = 1$, the left panel in the figure below visualizes this <em>linear</em> differential equation. The right panel visualizes its solutions, that is, the number of individuals as a function of time, $N(t)$, for three different initial conditions.</p>
<p><img src="/assets/img/2020-12-17-Dynamical-Systems.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>As can be seen, the population grows exponentially, without limits. While differential equations cannot be solved analytically in general, linear differential equations can. In our case, the solution is given by $N(t) = N_0 e^{t}$, as derived in a <a href="https://fabiandablander.com/r/Linear-Love.html">previous</a> blog post.</p>
<p>You can reproduce the trajectories shown in the right panel by using our solver from above:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">malthus</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="w">
</span><span class="n">solution_malthus</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">malthus</span><span class="p">,</span><span class="w"> </span><span class="n">x0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">)</span></code></pre></figure>
<p>We can inquire about qualitative features of dynamical systems models. One key feature are <em>equilibrium points</em>, that is, points at which the system does not change. Denote such points as $N^{\star}$, then formally:</p>
\[\frac{\mathrm{d}N^{\star}}{\mathrm{d}t} = f(N^{\star}) = 0 \enspace .\]
<p>In our model, the only equilibrium point is $N = 0$. Equilibrium points — also called fixed points — can be <em>stable</em> or <em>unstable</em>. A system that is at a stable equilibrium point returns to it after a small, exogeneous perturbation, but does not do so at an unstable equilibrium point. $N = 0$ is an unstable equilibrium point, and this is indicated by the white circle in the left panel above. In other words, if the population size is zero, and if we were to add some individuals, then the population would grow exponentially rather than die out.</p>
<h2 id="units-and-time-scales">Units and time scales</h2>
<p>In a differential equation, the units of the left hand-side must match the units of the right-hand side. In our example above, the left hand-side is given in population per unit of time, and so the right hand-side must also be in population per unit of time. Since $N$ is given in population, $r$ must be a rate, that is, have units $1 / \text{time}$. This brings us to a key question when dealing with dynamical system models. What is the time scale of the system?</p>
<p>The model cannot by itself provide an appropriate time scale. In our case, it clearly depends on whether we are looking at, say, a population of bacteria or rabbits. We can provide the system with a time scale by appropriately changing $r$. Take the bacterium <em>Escherichia coli</em>, which can double every 20 minutes. We know from above that this means exponential growth:</p>
\[N(t) = N_0 e^{r} \enspace.\]
<p>Supposing that we start with two bacteria $N_0 = 2$, then the value of $r$ that leads to a doubling every twenty minutes is given by:</p>
\[\begin{aligned}
4 &= 2 e^{r_{\text{coli}}} \\[0.5em]
r_{\text{coli}} &= \text{log }2 \enspace ,
\end{aligned}\]
<p>where this growth rate is with respect to twenty minutes. To get this per minute, we write $r_{\text{coli}} = \text{log }2 / 20$, resulting in the following differential equation for the population growth of <em>Escherichia coli</em>:</p>
\[\frac{\mathrm{d}N_{\text{coli}}}{\mathrm{d}t} = \frac{\text{log }2}{20} N_{\text{coli}} \enspace ,\]
<p>where the unit of time is now minutes. What about a population of rabbits? They grow much slower, of course. Suppose they take three months to double in population size (but see <a href="https://fabiandablander.com/r/Fibonacci.html">here</a>). This also yields a rate $r_{\text{rabbits}} = \text{log }2$, but this is with respect to three months. To get this in minutes, we assume that one month has 30 days and write</p>
\[r_{\text{rabbits}} = \text{log }2 / (3 \times 30 \times 24 \times 60) = \text{log }2 / 129600 \enspace.\]
<p>This yields the following differential equation for the growth of a population of rabbits:</p>
\[\frac{\mathrm{d}N_{\text{rabbits}}}{\mathrm{d}t} = \frac{\text{log }2}{129600} N_{\text{rabbits}} \enspace .\]
<p>The figure below contrasts the growth of <em>Escherichia coli</em> (left panel) with the growth of rabbits (right panel). Unsurprisingly, we see that rabbits are much slower — compare the $x$-axes! — to increase in population than <em>Escherichia coli</em>.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote">3</a></sup></p>
<p><img src="/assets/img/2020-12-17-Dynamical-Systems.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>This exponential growth model assumes that populations grow indefinitely, without limits. This is arguably incorrect, as Pierre-François Verhulst, a Belgian number theorist, was quick to point out.</p>
<h2 id="sigmoidal-population-growth">Sigmoidal population growth</h2>
<p>A population cannot grow indefinitely because its growth depends on resources, which are finite. To account for this, Pierre-François Verhulst introduced a model with a <em>carrying capacity</em> K in 1838, which gives the maximum size of the population that can be sustained given resource constraints (e.g., Bacaër, <a href="https://www.springer.com/gp/book/9780857291141">2011</a>, pp. 35-39). He wrote:</p>
\[\frac{\mathrm{d}N}{\mathrm{d}t} = r N \left(1 - \frac{N}{K}\right) \enspace .\]
<p>This equation is <em>nonlinear</em> in $N$ and is known as the <em>logistic equation</em>. If $K > N$ then $(1 − N / K) < 1$, slowing down the growth rate of $N$. If on the other hand $N > K$, then the population needs more resources than are available, and the growth rate becomes negative, resulting in population decrease.</p>
<p>The equation above has particular units. For example, $N$ gives the number of individuals in a population, be it bacteria or rabbits, and $K$ counts the maximum number of individuals that can be sustained given the available resources. Similarly, $r$ is a rate with respect to minutes or months, for example. For the purposes of this blog post, we are interested in general properties of this system and extensions thereof, rather than in modeling any particular real-world system. Therefore, we want to get rid of the parameters $K$ and $r$, which are specific to a particular population (say bacteria or rabbits).</p>
<p>We can eliminate $K$ by reformulating the differential equation in terms of $x = \frac{N}{K}$, which is $1$ if the population is at the carrying capacity. Implicit differentiation yields $K \cdot \mathrm{d}x = \mathrm{d}N$, which when plugged into the system gives:</p>
\[\begin{aligned}
K \cdot \frac{\mathrm{d}x}{\mathrm{d}t} &= r N \left(1 - \frac{N}{K}\right) \\[0.5em]
\frac{\mathrm{d}x}{\mathrm{d}t} &= r \frac{N}{K} \left(1 - \frac{N}{K}\right) \\[0.5em]
\frac{\mathrm{d}x}{\mathrm{d}t} &= rx \left(1 - x\right) \enspace .
\end{aligned}\]
<p>Both $N$ and $K$ count the number of individuals (e.g., bacteria or rabbits) in the population, and their ratio $x$ is unit- or dimensionless. For example, $x = 0.50$ means that the population is at half the size that can be sustained at carrying capacity, and we do not need to know the exact number of individuals $N$ and $K$ for this statement to make sense.</p>
<p>In other words, we have <em>non-dimensionalized</em> the differential equation, at least in terms of $x$. We can also remove the dimension of time (whether it is minutes or months, for example), by making the change of variables $\tau = t r$. Since $t$ is given in units of time, and $r$ is given in inverse units of time since it is a rate, $\tau$ is dimensionless. Implicit differentiation yields $\frac{1}{r}\mathrm{d}\tau = \mathrm{d}t$, which plugged in gives:</p>
\[\begin{aligned}
\frac{\mathrm{d}x}{\left(\frac{1}{r}\mathrm{d}\tau\right)} &= r x (1 - x) \\[0.5em]
r \frac{\mathrm{d}x}{\mathrm{d}\tau} & = r x (1 - x) \\[0.5em]
\frac{\mathrm{d}x}{\mathrm{d}\tau} & = x (1 - x) \enspace .
\end{aligned}\]
<p>This got rid of another parameter, $r$, and hence simplifies subsequent analysis. The differential equation now tells us how the population relative to carrying capacity ($x$) changes per unit of dimensionless time ($\tau$). The left panel below shows the dimensionless logistic equation, while the right panel shows its solution for three different initial conditions.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote">4</a></sup></p>
<p><img src="/assets/img/2020-12-17-Dynamical-Systems.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>You can again reproduce the solutions by running:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">verhulst</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="n">solution_verhulst</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">verhulst</span><span class="p">,</span><span class="w"> </span><span class="n">x0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">)</span></code></pre></figure>
<p>In contrast to the exponential population growth in the previous model, this model shows a sigmoidal growth that hits its ceiling at carrying capacity, $x = N / K = 1$.</p>
<p>We can again analyze the equilibrium points of this system. In addition to the unstable fixed point at $x^{\star} = 0$, the model also has a stable fixed point at $x^{\star} = N / K = 1$, which is indicated by the gray circle in the left panel. Why is this point stable? Looking at the left panel above, we see that if we were to decrease the population size we have that $\frac{\mathrm{d}x}{\mathrm{d}\tau} > 0$, and hence the population increases towards $x^{\star} = 1$. If, on the other hand, we would increase the population size above the carrying capacity $x > 1$, we have that $\frac{\mathrm{d}x}{\mathrm{d}\tau} < 0$ (not shown in the graph), and so the population size decreases towards $x^{\star} = 1$.</p>
<p>Given any initial condition $x_0 > 0$, the system moves towards the stable equilibrium point $x^{\star} = 1$. This initial movement is a <em>transient phase</em>. Once this phase is over, the system stays at the stable fixed point forever (unless perturbations move it away). I will come back to transients in a later section.</p>
<p>While we have improved on the exponential growth model by encoding <a href="https://www.youtube.com/watch?v=kz9wjJjmkmc">limits to growth</a>, many populations are subject to another force that constraints their growth: predation. In the next section, we will extend the model to allow for predation.</p>
<h2 id="population-growth-under-predation">Population growth under predation</h2>
<!-- Most animals get eaten by other animals, and so the population size of a particular *prey* is influenced by *predators*. -->
<p>In a classic article, Robert May (<a href="https://www.nature.com/articles/269471a0">1977</a>) studied the following model:</p>
\[\frac{\mathrm{d}x}{\mathrm{d}\tau} = \underbrace{x \left(1 - x\right)}_{\text{Logistic term}} - \underbrace{\gamma \frac{x^2}{\alpha^2 + x^2}}_{\text{Predation term}} \enspace ,\]
<p>which includes a predation term that depends nonlinearly on the population size $x$.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote">5</a></sup> This term tells us, for any population size $x$, how strong the pressure on the population due to predation is. The parameter $\alpha$ gives the saturation point, that is, the population size at which predation slows down. If this value is low, the extent of predation rises rapidly with an increased population. To see this, the left panel in the figure below visualizes the predation term for different values of $\alpha$, fixing $\gamma = 0.50$. The parameter $\gamma$, on the other hand, influences the maximal extent of predation. Fixing $\alpha = 0.10$, the right panel shows how the extent of the predation increases with $\gamma$.</p>
<p><img src="/assets/img/2020-12-17-Dynamical-Systems.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>As we have discussed before, the units of the left hand side need to be the same as the units of the right hand side. As our excursion in nondimensionalization has established earlier, the logistic equation in terms of $\frac{\mathrm{d}x}{\mathrm{d}\tau}$ is dimensionless — it tells us how the population relative to carrying capacity ($x$) changes per unit of dimensionless time ($\tau$). The predation term we added also needs to be dimensionless, because summing quantities of different units is meaningless. In order for $\alpha^2 + x^2$ to make sense, $\alpha$ must also be given in population relative to the carrying capacity, that is, it must be dimensionless. The parameter $\gamma$ must be given in population relative to carrying capacity per unit of dimensionless time. We can interpret it as the maximum proportion of individuals (relative to carrying capacity) that is killed by predation that is theoretically possible (if $\alpha = 0$ and $x = 1$), per unit of dimensionless time. What a mouthful! Maybe we should have kept the original dimensions? But that would have left us with more parameters! Fearless and undeterred, we move on. For simplicity of analysis, however, we fix $\alpha = 0.10$ for the remainder of this blog post.</p>
<p>Now that we have the units straight, let’s continue with the analysis of the system. We are interested in finding the equilibrium points $x^{\star}$, and we could do this algebraically by solving the following for $x$:</p>
\[\begin{aligned}
0 &= x \left(1 - x\right) - \gamma \frac{x^2}{0.10^2 + x^2} \\
x \left(1 - x\right) &= \gamma \frac{x^2}{0.10^2 + x^2} \enspace ,
\end{aligned}\]
<p>However, we can also find the equilibrium points graphically, by visualizing both the left-hand and the right-hand side and seeing where the two lines intersect. Importantly, the intersections will depend on the parameter $\gamma$. The left panel in the figure below illustrates this for three different values of $\gamma$. For all values of $\gamma$, there exists an unstable equilibrium point at $x^{\star} = 0$. If $\gamma = 0$, then the predation term vanishes and we get back the logistic equation, which has a stable equilibrium point at $x^{\star} = 1$. For a low predation rate $\gamma = 0.10$, this stable equilibrium point gets shifted below the carrying capacity and settles at $x^{\star} = 0.89$. Astonishingly, for the intermediate value $\gamma = 0.22$, two stable equilibrium points emerge, one at $x^{\star} = 0.03$ and one at $x^{\star} = 0.68$, separated by an unstable equilibrium point at $x^{\star} = 0.29$. For $\gamma = 0.30$, the stable and unstable equilibrium points have vanished, and a single stable equilibrium point at a very low population size $x^{\star} = 0.04$ remains.</p>
<p><img src="/assets/img/2020-12-17-Dynamical-Systems.Rmd/unnamed-chunk-8-1.png" title="plot of chunk unnamed-chunk-8" alt="plot of chunk unnamed-chunk-8" style="display: block; margin: auto;" /></p>
<p>While the left panel above shows three specific values of $\gamma$, the panel on the right visualizes the stable (solid lines) and unstable (dashed lines) equilibrium points as a function of $\gamma \in (0, 0.40)$. This is known as a <em>bifurcation diagram</em> — it tells us the equilibrium points of the system and their stability as a function of $\gamma$. The black dots indicate the <em>bifurcation points</em>, that is, values of $\gamma$ at which the (stability of) equilibrium points change. For this system, we have that a stable and unstable equilibrium point collide and vanish at $\gamma = 0.18$ and $\gamma = 0.26$; while there are many <a href="https://en.wikipedia.org/wiki/Bifurcation_theory#Bifurcation_types">types of bifurcations</a> a system can exhibit, this type of bifurcation is known as a <em>saddle-node bifurcation</em>. The coloured lines indicate the three specific values from the left panel.</p>
<p>This simple model has a number of interesting properties that we will explore in the next sections. Before we do this, however, we look at another way to visualize the behaviour of the system.</p>
<h2 id="potentials">Potentials</h2>
<p>For unidimensional systems one can visualize the dynamics of the system in an intuitive way by using so-called “ball-in-a-cup” diagrams. Such diagrams visualize the <em>potential function</em> $V(x)$, which is defined in the following manner:</p>
\[\frac{\mathrm{d}x}{\mathrm{d}\tau} = -\frac{\mathrm{d}V}{\mathrm{d}x} \enspace .\]
<p>To solve this, we can integrate both sides with respect to $x$, which yields</p>
\[V(x) = - \int \frac{\mathrm{d}x}{\mathrm{d}\tau} \mathrm{d}x + C \enspace ,\]
<p>where $C$ is the constant of integration, and the potential is defined only up to an additive constant. Notice that $V$ is a function of $x$, rather than a function of time $\tau$. As we will see shortly, $x$ will be the “ball” in the “cup” or landscape that is carved out by the potential $V$. Setting $C = 0$, the potential for the logistic equation with predation is given by:</p>
\[V(x) = \gamma\, \alpha \, \text{tan}^{-1} \left(\frac{x}{\alpha}\right) - \frac{1}{2} x^2 - \frac{1}{3} x^3 \enspace .\]
<p>The figure below visualizes the potentials for three different values of $\gamma$; since the scaling of $V(x)$ is arbitrary, I removed the $y$-axis. The left panel shows the potential for $\gamma = 0.10$, and this corresponds to the case where one unstable fixed point $x^{\star} = 0$ and one stable fixed point at $x^{\star} = 0.89$ exists. We can imagine the population $x$ as a ball in this landscape; if $x = 0$ and we add individuals to the population, then the ball rolls down into the valley whose lowest point is the stable state $x^{\star} = 0.89$.</p>
<p>The rightmost panel shows that under a high predation rate $\gamma = 0.30$, there are again two fixed points, one unstable one at $x^{\star} = 0$ and one stable fixed point at a very low population size $x^{\star} = 0.04$. Whereever we start on this landscape, unless $x_0 = 0$, the population will move towards $x^{\star} = 0.04$.</p>
<p><img src="/assets/img/2020-12-17-Dynamical-Systems.Rmd/unnamed-chunk-9-1.png" title="plot of chunk unnamed-chunk-9" alt="plot of chunk unnamed-chunk-9" style="display: block; margin: auto;" /></p>
<p>The middle panel above is the most interesting one. It shows that the potential for $\gamma = 0.22$ exhibits two valleys, corresponding to the stable fixed points $x^{\star} = 0.03$ and $x^{\star} = 0.68$. These two points are separated by a hill, corresponding to an unstable fixed point at $x^{\star} = 0.29$. Depending on the initial condition, the population would either converge to a very low or moderately high stable size. For example, if $x_0 = 0.25$, then we would “roll” towards the left, into the valley whose lowest point corresponds to $x^{\star} = 0.03$. On the other hand, if $x_0 = 0.40$, then individuals can procreate unabated by predation and reach a stable point at $x^{\star} = 0.68$.</p>
<p>Visualizing potentials as “ball-in-a-cup” diagrams is a wide-spread way to communicate the ideas of stable and unstable equilibrium points. But they suffer from a number of limitations, and they are a little bit of a gimick. Potentials generally do not exist for higher-dimensional systems (see Rodríguez-Sánchez, van Nes, & Scheffer, <a href="https://journals.plos.org/ploscompbiol/article?rev=2&id=10.1371/journal.pcbi.1007788">2020</a>, for an approximation).</p>
<p>A necessary requirement for a system to exhibit multiple stable states are positive feedback mechanisms (e.g., Kefi, Holmgren, & Scheffer <a href="https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2435.12601">2016</a>). In our example, it may be that, at a sufficient size, individuals in the population can coordinate so as to fight predators better. This would allow them to grow towards the higher stable population size. Below that “coordination” point, however, they cannot help each other efficiently, and predators may have an easy time feasting on them — the population converges to the smaller stable population size. This constitutes an <a href="https://en.wikipedia.org/wiki/Allee_effect">Allee effect</a>.</p>
<p>Systems that exhibit multiple stable states can show <em>critical transitions</em> between them. These transitions are not only notoriously hard to predict, but they can also be hard to reverse, as we will see in the next section.</p>
<!-- Two important features of a dynamical system are its *stability* and *resilience*. There is substantial heterogeneity in how these two terms are used (see e.g., Pimm, 1984; Scheffer et al., 2015). For our purposes here, we define stability as the time it takes the system to return to equilibrium after a small perturbation. Resilience, on the other hand, is defined as the size of the perturbation the system can withstand before going into another equilibrium. Stability is a multidimensional concept, however. -->
<!-- ## Multiple stable states -->
<!-- A necessary condition for the existence of multiple stable states are positive feedback mechanisms. -->
<h2 id="critical-transitions-and-hysteresis">Critical transitions and hysteresis</h2>
<p>What happens if we slowly increase the extent of predation in our toy model? To answer this, we allow for a slowly changing $\gamma$. Formally, we add a differential equation for $\gamma$ to our system:</p>
\[\begin{aligned}
\frac{\mathrm{d}x}{\mathrm{d}\tau} &= x \left(1 - x\right) - \gamma \frac{x^2}{\alpha^2 + x^2} \\[0.50em]
\frac{\mathrm{d}\gamma}{\mathrm{d}\tau} &= \beta \enspace ,
\end{aligned}\]
<p>where $\beta > 0$ is a constant. In contrast to the first model we studied, $\frac{\mathrm{d}\gamma}{\mathrm{d}\tau}$ does not itself depend on $\gamma$, and hence will not show exponential growth. Instead, its rate of change is constant at $\beta$, and so $\gamma(\tau)$ is a linear function with slope given by $\beta$.</p>
<p>Note further that the differential equation for $\gamma$ does not feature a term that depends on $x$, which means that it is not influenced by changes in the population size. The differential equation for $x$ obviously includes a term that depends on $\gamma$, and so the population size will be influenced by changes in $\gamma$, as we will see shortly. We can again numerically approximate the system reasonably well when choosing $\Delta_t$ to be small. We add small additive perturbations to $x$ at each time step, writing:</p>
\[\begin{aligned}
x_{n + 1} &= x_n + \Delta_t \cdot f(x_n, \gamma_n) + \varepsilon_n \\[0.50em]
\gamma_{n + 1} &= \gamma_n + \Delta_t \cdot \beta \enspace ,
\end{aligned}\]
<p>where $f$ is the logistic equation with predation and $\varepsilon_n \sim \mathcal{N}(0, \sigma)$.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote">6</a></sup> We implement this in R as follows:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_err</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">f</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">x0</span><span class="p">,</span><span class="w"> </span><span class="n">gamma0</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w">
</span><span class="n">nr_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.0001</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">xt</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">x0</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="p">)</span><span class="w">
</span><span class="n">gammat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">gamma0</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">gammat</span><span class="p">[</span><span class="n">n</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gammat</span><span class="p">[</span><span class="n">n</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">beta</span><span class="w">
</span><span class="n">xt</span><span class="p">[</span><span class="n">n</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">xt</span><span class="p">[</span><span class="n">n</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">f</span><span class="p">(</span><span class="n">xt</span><span class="p">[</span><span class="n">n</span><span class="m">-1</span><span class="p">],</span><span class="w"> </span><span class="n">gammat</span><span class="p">[</span><span class="n">n</span><span class="m">-1</span><span class="p">])</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nr_iter</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">delta_t</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">xt</span><span class="p">,</span><span class="w"> </span><span class="n">gammat</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>We treat $\gamma$ as a parameter that <em>slowly</em> increases from $\gamma = 0$ to $\gamma = 0.40$. We encode the fact that $\gamma$ changes slowly by setting $\beta$ to a small value, in this case $\beta = 0.004$. The average absolute rate of change of $x$ across population sizes and $\gamma \in [0, 0.40]$ is about $0.10$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">may</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">)</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">0.10</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.01</span><span class="p">)</span><span class="w">
</span><span class="n">gammas</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.40</span><span class="p">,</span><span class="w"> </span><span class="m">0.01</span><span class="p">)</span><span class="w">
</span><span class="n">mean</span><span class="p">(</span><span class="n">sapply</span><span class="p">(</span><span class="n">gammas</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">gamma</span><span class="p">)</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">may</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">)))))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 0.09725383</code></pre></figure>
<p>and so $x$ changes about $0.10 / 0.004 = 25$ times faster than $\gamma$ on average. Systems where one component changes quickly and the other more slowly are called <em>fast-slow</em> systems (e.g., Kuehn, <a href="https://link.springer.com/article/10.1007/s00332-012-9158-x">2013</a>). The code simulates one trajectory that we will visualize below.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">delta_t</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0.01</span><span class="w">
</span><span class="n">nr_iter</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">10000</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve_err</span><span class="p">(</span><span class="w">
</span><span class="n">f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">may</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.004</span><span class="p">,</span><span class="w"> </span><span class="n">x0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">gamma0</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w">
</span><span class="n">nr_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_iter</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">delta_t</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.001</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>As a reminder, the left panel in the figure below shows the bifurcation diagram for the logistic equation with predation. The right panel shows a critical transition. In particular, the solid black line shows the time-evolution of the population starting at carrying capacity $x_0 = 1$. We slowly increase the predation rate from $\gamma = 0$ up to $\gamma = 0.40$, as the solid blue line indicates. The population size decreases gradually as we increase $\gamma$, and this closely follows the bifurcation diagram on the left. At the bifurcation point $\gamma = 0.26$, however, the population size crashes down to very low levels.</p>
<p><img src="/assets/img/2020-12-17-Dynamical-Systems.Rmd/unnamed-chunk-14-1.png" title="plot of chunk unnamed-chunk-14" alt="plot of chunk unnamed-chunk-14" style="display: block; margin: auto;" /></p>
<p>Can we recover the population size by decreasing $\gamma$ again? We can, but it requires substantially more effort! The solid gray line indicates the trajectory starting from a low population size and a high predation rate $\gamma = 0.40$. We reduce $\gamma$, but we have to reduce it all the way to $\gamma = 0.18$ for the population to then suddenly recover again. The phenomenon that a transition may be hard to reverse in this specific sense is known as <em>hysteresis</em>.</p>
<p>In the time-series above, we see that the system moves eratically around the equilibrium — after perturbations which push the system out of equilibrium, it is quick to return to equilibrium. Importantly, changes in $\gamma$ affect the equilibrium of the system itself. At the saddle-node bifurcation $\gamma = 0.26$, the stable equilibrium point vanishes, and the system moves towards the other stable equilibrium point that entails a much lower population size. This “crashing down” is a <em>transient phase</em> during which the system is out of equilibrium. How long the system takes to reach another stable equilibrium point after the one it tracked vanished depends on the nature of the system. Given the eagerness with which <em>Escherichia coli</em> reproduces, for example, it is a matter of mere hours until its population has recovered after predation has been sufficiently reduced. Transitions in the earth’s climate, however, may take hundreds of years.</p>
<p>Here, we assume that we know the equation that describes population growth under predation. For almost all real-word systems, however, we do not have an adequate model and thus may not know whether a particular system is in a transient phase, or whether the changes we are seeing are due to changes in underlying parameters that influence the equilibrium. If the system is in a transient phase, it can change without any change in parameters or perturbations, which — from a conservation perspective — is slightly unsettling (Hastings et al., <a href="https://science.sciencemag.org/content/361/6406/eaat6412.abstract">2018</a>). Yet transients can also hold opportunities. For example, if a system that is pushed over a tipping point has a slow transient, we may still be able to intervene and nurture the system back before it crashes into the unfavourable stable state (Hughes et al., <a href="https://www.sciencedirect.com/science/article/abs/pii/S0169534712002170">2013</a>).</p>
<p>While the simple models we look at in this blog post quickly settle into equilibrium, many real-world systems are periodically forced (e.g., Rodríguez-Sánchez, <a href="https://research.wur.nl/en/publications/cycles-and-interactions-a-mathematician-among-biologists">2020</a>; Strogatz, <a href="http://www.stevenstrogatz.com/books/sync-the-emerging-science-of-spontaneous-order">2003</a>) and may never do so. This can lead to interesting dynamics and has implication for critical transitions (e.g., Bathiany et al. <a href="https://www.nature.com/articles/s41598-018-23377-4">2018</a>), but this is for another blog post.</p>
<p>Critical transitions — such as the one illustrated in the figure above — are hard to foresee. Looking at how the mean of the population size changes, one would not expect a dramatic crash as predation increases. In the next section, we will see how the phenomenon of <em>critical slowing down</em> may help us anticipate such critical transitions.</p>
<h2 id="critical-slowing-down">Critical slowing down</h2>
<p>The logistic equation with predation exhibits a phenomenon called <em>critical slowing down</em>: as the population approaches the saddle-node bifurcation, it returns more slowly to the stable equilibrium after small perturbations. We can study this analytically in our simple model. In particular, we are interested in the dynamics of the system after a (small) perturbation $\eta(\tau)$ that pushes the system away from the fixed point. We write:</p>
\[\begin{aligned}
x(\tau) &= x^{\star} + \eta(\tau) \enspace .
\end{aligned}\]
<p>This is essentialy what we had when we simulated from the system and added a little bit of noise at each time step. The dynamics of the system close to the fixed point turn out to be the same as the dynamics of the noise. To see this, we derive a differential equation for $\eta = x - x^{\star}$:</p>
\[\frac{\mathrm{d}\eta}{\mathrm{d}\tau} = \frac{\mathrm{d}}{\mathrm{d}\tau} (x - x^{\star}) = \frac{\mathrm{d}x}{\mathrm{d}\tau} - \frac{\mathrm{d}x^{\star}}{\mathrm{d}\tau} = \frac{\mathrm{d}x}{\mathrm{d}\tau} = f(x) = f(x^{\star} + \eta) \enspace ,\]
<p>since the rate of change at the fixed point, $\frac{\mathrm{d}x^{\star}}{\mathrm{d}\tau}$, is zero and where $f$ is the logistic equation with predation. This tells us that the dynamics of the perturbation $\eta$ is simply given by the dynamics of the system evaluated at $f(x^{\star} + \eta)$. For simplicity, we <a href="https://www.youtube.com/watch?v=3d6DsjIBzJ4">linearize this equation</a>, writing:</p>
\[\begin{aligned}
\frac{\mathrm{d}\eta}{\mathrm{d}\tau} = f(x^{\star} + \eta) &= f(x^{\star}) + \eta f'(x^{\star}) + \mathcal{O}(\eta^2) \\
&\approx \eta f'(x^{\star}) \enspace ,
\end{aligned}\]
<p>since $f(x^{\star}) = 0$ and where we ignore higher-order terms $\mathcal{O}(\eta^2)$. While the symbols are different, the structure of the equation might look familiar. In fact, it is a linear equation in $\eta$, and so its solution is given by the exponential function:</p>
\[\eta(\tau) = \eta_0 e^{\tau f'(x^{\star})} \enspace ,\]
<p>where $\eta_0$ is the initial condition. Therefore, the dynamics of the system close to the fixed point $x^{\star}$ is given by:</p>
\[x(\tau) = x^{\star} + \eta_0 e^{\tau f'(x^{\star})} \enspace .\]
<p>In sum, we have derived an (approximate) equation that describes the dynamics of the system close to equilibrium after a small perturbation.
As an aside, this approximation can be used to analyze the stability of fixed points: at a stable fixed point, $f’(x^{\star}) < 0$ and the system hence returns to the fixed point; at unstable fixed points, $f’(x^{\star}) > 0$ and the system moves away from the fixed point. Such an analysis is known as <em>linear stability analysis</em>, because we have linearized the system dynamics close to the fixed point.</p>
<p>We are now in a position to illustrate the phenomenon of critical slowing down. In particular, note that $\eta(\tau)$ depends on the <em>derivative</em> of the differential equation $f$ with respect to $x$ — denoted by $f’$ — evaluated at the fixed point $x^{\star}$. For the logistic equation with predation, we have that $f’$:</p>
\[\begin{aligned}
f' = \frac{\mathrm{d}f}{\mathrm{d}x} &= \frac{\mathrm{d}}{\mathrm{d}{x}}\left(x(1 - x) - \gamma \frac{x^2}{0.01 + x^2}\right) \\[0.50em]
&= 1 - 2x - \gamma \frac{0.02x}{(0.01 + x^2)^2} \enspace .
\end{aligned}\]
<p>In the following, we will evaluate this function at various equilibrium points $x^{\star}$, which depend on $\gamma$, as we have seen before in the bifurcation diagram. To make this apparent, we define a new function:</p>
\[\begin{equation}
\lambda(x^{\star}, \gamma) = 1 - 2x^{\star} - \gamma \frac{0.02x^{\star}}{(0.01 + (x^{\star})^2)^2} \enspace ,
\end{equation}\]
<p>where the value of $x^{\star}$ is constrained by $\gamma$.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote">7</a></sup> This function gives the <em>recovery rate</em> of the system from small perturbations close to the equilibrium. The code for this function is:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">lambda</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">xstar</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">)</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">xstar</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">0.02</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">xstar</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="m">0.01</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">xstar</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="m">2</span></code></pre></figure>
<p>In order to get the equilibrium points $x^{\star}$ for a particular value of $\gamma$, we need to find the values of $x$ for which the logistic equation with predation is zero. We can do this using the following code:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'rootSolve'</span><span class="p">)</span><span class="w">
</span><span class="n">get_fixedpoints</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">gamma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">uniroot.all</span><span class="p">(</span><span class="n">f</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="n">may</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">),</span><span class="w"> </span><span class="n">interval</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Let’s apply this on an example. The recovery rate from perturbations away from a particular fixed point $x^{\star}$ is given by $\lambda(x^{\star}, \gamma)$, and so a smaller absolute value for $\lambda$ will result in a slower recovery. Take $\gamma = 0.18$ and $\gamma = 0.24$ as examples. For these values, there are two stable fixed points, and suppose that the system is at the larger fixed point. These fixed points are given by $x^{\star} = 0.77$ and $x^{\star} = 0.63$, respectively, as the following computation shows:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">rbind</span><span class="p">(</span><span class="w">
</span><span class="c1"># unstable, stable, unstable, stable</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">get_fixedpoints</span><span class="p">(</span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.18</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">get_fixedpoints</span><span class="p">(</span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.24</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1] [,2] [,3] [,4]
## [1,] 0 0.10 0.13 0.77
## [2,] 0 0.05 0.32 0.63</code></pre></figure>
<p>We can plug these fixed points into the equation for $\lambda$, which gives us the respective rates with which these systems return to equilibrium. These are:</p>
\[\begin{aligned}
\lambda(x^{\star} = 0.77, \gamma = 0.18) &= -0.55 \\
\lambda(x^{\star} = 0.63, \gamma = 0.24) &= -0.28 \enspace ,
\end{aligned}\]
<p>which can easily be verified:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">rbind</span><span class="p">(</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">lambda</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.77</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.18</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">lambda</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.63</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.24</span><span class="p">),</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] -0.55
## [2,] -0.28</code></pre></figure>
<p>Indeed, the system for which $\gamma = 0.24$ has $\lambda$ smaller in absolute value than the system for which $\gamma = 0.18$, and thus returns more slowly to equilibrium after an external perturbation.</p>
<p>The left panel in the figure below shows how $\lambda$ changes as a continuous function of $\gamma \in [0, 0.40]$. We see that $\lambda$ increases towards $\lambda = 0$ at the saddle-node bifurcation $\gamma = 0.26$ (when coming from the left) or $\gamma = 0.18$ (when coming from the right). The dashed gray lines indicates $\lambda > 0$, which is the case for unstable equilibrium points; in other words, perturbations do not decay but grow close to the unstable equilibrium point, and hence the system does not return to it.</p>
<p>The panel on the right illustrates the slower recovery rate. In particular, I simulate from these two systems and, at $\tau = 10$, half their population size. The system with $\gamma = 0.18$ recovers more swiftly to its stable equilibrium than the system with $\gamma = 0.24$.</p>
<p><img src="/assets/img/2020-12-17-Dynamical-Systems.Rmd/unnamed-chunk-19-1.png" title="plot of chunk unnamed-chunk-19" alt="plot of chunk unnamed-chunk-19" style="display: block; margin: auto;" /></p>
<p>The phenomenon of critical slowing down is the basis of widely used <em>early warning signals</em> such as autocorrelation and variance. Indeed, one can show that the autocorrelation and variance are given by $e^{\lambda}$ and $\frac{\sigma_{\varepsilon}^2}{1 - e^{2\lambda}}$, respectively, where $\sigma_{\varepsilon}^2$ is the noise variance (see e.g. the appendix in Dablander et al., <a href="https://psyarxiv.com/5wc28">2020</a>). Hence, these quantities will increase as the system approaches the bifurcation point, as the figure below illlustrates.</p>
<p><img src="/assets/img/2020-12-17-Dynamical-Systems.Rmd/unnamed-chunk-20-1.png" title="plot of chunk unnamed-chunk-20" alt="plot of chunk unnamed-chunk-20" style="display: block; margin: auto;" /></p>
<p>Early warning signals based on critical slowing down have seen a surge in attention in the last two decades, with prominent review articles in ecology and climate science (e.g., Scheffer et al., <a href="https://www.nature.com/articles/nature08227">2009</a>, <a href="https://science.sciencemag.org/content/338/6105/344.abstract">2012</a>; Lenton, <a href="https://www.nature.com/articles/nclimate1143">2011</a>). The idea of critical slowing down goes back much further, and was well-known to proponents of <em>catastrophe theory</em> (e.g., Zeeman, <a href="https://www.jstor.org/stable/24950329">1976</a>); indeed, critical slowing down is one of the so-called <em>catastrophe flags</em> (see van der Maas et al. <a href="https://journals.sagepub.com/doi/abs/10.1177/0049124103253773">2003</a>, for an overview and an application to attitudes). Wissel (<a href="https://pubmed.ncbi.nlm.nih.gov/28312117/">1984</a>) (re)discovered critical slowing down in simple systems and used it to predict the extinction of a population of rotifers. The wonderful experimental demonstrations by Drake & Griffin (<a href="https://www.nature.com/articles/nature09389/">2010</a>) and Dai et al. (<a href="https://science.sciencemag.org/content/336/6085/1175.abstract">2012</a>) are modern variations on that theme.</p>
<p>Critical transitions are notoriously hard to predict, and the potential of generic signals that warn us of such transitions is vast. Early warning signals based on critical slowing down are subject to a number of practical and theoretical limitations, however — for example, they can occur prior to transitions that are not critical, and they can fail to occur prior to critical transitions. For an overview and a discussion, see Dablander, Pichler, Cika, & Bacilieri (<a href="https://psyarxiv.com/5wc28">2020</a>).</p>
<h1 id="conclusion">Conclusion</h1>
<p>Dynamical systems theory is a powerful framework for modelling how systems change over time. In this blog post, we have looked at simple toy models to elucidate some core concepts. Intriguingly, we have seen that even a very simple model can exhibit intricate behaviour, such as multiple stable states and critical transitions. Yet most interesting real-world systems are much more complex, and care must be applied when translating intuitions from low-dimensional toy models into high-dimensional reality.</p>
<hr />
<p><em>I would like to thank <a href="https://twitter.com/abacilieri">Andrea Bacilieri</a>, <a href="https://twitter.com/jillderon">Jill de Ron</a>, <a href="https://twitter.com/jonashaslbeck">Jonas Haslbeck</a>, and <a href="https://twitter.com/Oisin_Ryan_">Oisín Ryan</a> for helpful comments on this blog post.</em></p>
<hr />
<h1 id="references">References</h1>
<ul>
<li>Abbott, K. C., Ji, F., Stieha, C. R., & Moore, C. M. (<a href="https://link.springer.com/article/10.1007/s12080-019-00441-x">2020</a>). Fast and slow advances toward a deeper integration of theory and empiricism. <em>Theoretical Ecology, 13</em>(1), 7-15.</li>
<li>Bacaër, N. (<a href="https://www.springer.com/gp/book/9780857291141">2011</a>). <em>A short history of mathematical population dynamics</em>. Springer Science & Business Media.</li>
<li>Bathiany, S., Scheffer, M., Van Nes, E. H., Williamson, M. S., & Lenton, T. M. (<a href="https://www.nature.com/articles/s41598-018-23377-4">2018</a>). Abrupt climate change in an oscillating world. <em>Scientific reports, 8</em>(1), 1-12.
<!-- - Carpenter, S. R., Folke, C., Scheffer, M., & Westley, F. R. ([2019](https://www.ecologyandsociety.org/vol24/iss1/art23/)). Dancing on the volcano: social exploration in times of discontent. *Ecology and Society, 24*(1). --></li>
<li>Dablander, F., Pichler, A., Cika, A., & Bacilieri, A. (<a href="https://psyarxiv.com/5wc28">2020</a>). Anticipating Critical Transitions in Psychological Systems using Early Warning Signals: Theoretical and Practical Considerations.</li>
<li>Dai, L., Vorselen, D., Korolev, K. S., & Gore, J. (<a href="https://science.sciencemag.org/content/336/6085/1175.abstract">2012</a>). Generic indicators for loss of resilience before a tipping point leading to population collapse. <em>Science, 336</em>(6085), 1175-1177.
<!-- - Dudney, J., & Suding, K. N. ([2020](https://www.nature.com/articles/s41559-020-1273-8)). The elusive search for tipping points. *Nature Ecology & Evolution, 4*(11), 1449-1450. --></li>
<li>Drake, J. M., & Griffen, B. D. (<a href="https://www.nature.com/articles/nature09389/">2010</a>). Early warning signals of extinction in deteriorating environments. <em>Nature, 467</em>(7314), 456-459.
<!-- - Duncan, J. P., Aubele-Futch, T., & McGrath, M. ([2019](https://epubs.siam.org/doi/abs/10.1137/18M121410X)). A fast-slow dynamical system model of addiction: Predicting relapse frequency. *SIAM Journal on Applied Dynamical Systems, 18*(2), 881-903. --></li>
<li>Hastings, A., Abbott, K. C., Cuddington, K., Francis, T., Gellner, G., Lai, Y. C., … & Zeeman, M. L. (<a href="https://science.sciencemag.org/content/361/6406/eaat6412.abstract">2018</a>). Transient phenomena in ecology. <em>Science, 361</em>(6406).
<!-- - Hillebrand, H., Donohue, I., Harpole, W. S., Hodapp, D., Kucera, M., Lewandowska, A. M., ... & Freund, J. A. ([2020](https://www.nature.com/articles/s41559-020-1256-9)). Thresholds for ecological responses to global change do not emerge from empirical data. *Nature Ecology & Evolution, 4*(11), 1502-1509. --></li>
<li>Hughes, T. P., Linares, C., Dakos, V., Van De Leemput, I. A., & Van Nes, E. H. (<a href="https://www.sciencedirect.com/science/article/pii/S0169534712002170">2013</a>). Living dangerously on borrowed time during slow, unrecognized regime shifts. <em>Trends in Ecology & Evolution, 28</em>(3), 149-155.</li>
<li>Kallis, G. (<a href="https://www.sup.org/books/title/?id=29999">2019</a>). <em>Limits: Why Malthus was wrong and why environmentalists should care.</em> Stanford University Press.</li>
<li>Kéfi, S., Holmgren, M., & Scheffer, M. (<a href="https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2435.12601">2016</a>). When can positive interactions cause alternative stable states in ecosystems?. <em>Functional Ecology, 30</em>(1), 88-97.</li>
<li>Kuehn, C. (<a href="https://link.springer.com/article/10.1007/s00332-012-9158-x">2013</a>). A mathematical framework for critical transitions: normal forms, variance and applications. <em>Journal of Nonlinear Science, 23</em>(3), 457-510.</li>
<li>Lenton, T. M. (<a href="https://www.nature.com/articles/nclimate1143">2011</a>). Early warning of climate tipping points. <em>Nature Climate Change, 1</em>(4), 201-209.
<!-- - Lenton, T. M., Rockström, J., Gaffney, O., Rahmstorf, S., Richardson, K., Steffen, W., & Schellnhuber, H. J. ([2019](https://www.nature.com/articles/d41586-019-03595-0)). Climate tipping points—too risky to bet against. *Nature, 575*. -->
<!-- - Litzow, M. A., & Hunsicker, M. E. ([2016](https://esajournals.onlinelibrary.wiley.com/doi/full/10.1002/ecs2.1614)). Early warning signals, nonlinearity, and signs of hysteresis in real ecosystems. *Ecosphere, 7*(12), e01614. --></li>
<li>Ludwig, D., Jones, D. D., & Holling, C. S. (<a href="https://www.jstor.org/stable/3939">1978</a>). Qualitative analysis of insect outbreak systems: the spruce budworm and forest. <em>The Journal of Animal Ecology, 47</em>(1), 315-332.</li>
<li>May, R. M. (<a href="https://www.nature.com/articles/269471a0">1977</a>). Thresholds and breakpoints in ecosystems with a multiplicity of stable states. <em>Nature, 269</em>(5628), 471-477.
<!-- - Otto, I. M., Donges, J. F., Cremades, R., Bhowmik, A., Hewitt, R. J., Lucht, W., ... & Lenferna, A. ([2020](https://www.pnas.org/content/117/5/2354)). Social tipping dynamics for stabilizing Earth’s climate by 2050. *Proceedings of the National Academy of Sciences, 117*(5), 2354-2365. -->
<!-- - Petraitis, P. ([2013](https://global.oup.com/academic/product/multiple-stable-states-in-natural-ecosystems-9780199569342?cc=it&lang=en&)). *Multiple stable states in natural ecosystems*. Oxford University Press. --></li>
<li>Rodríguez-Sánchez, P. (<a href="https://research.wur.nl/en/publications/cycles-and-interactions-a-mathematician-among-biologists">2020</a>). <em>Cycles and interactions: A mathematician among biologists</em>. PhD Thesis.</li>
<li>Rodríguez-Sánchez, P., Van Nes, E. H., & Scheffer, M. (<a href="https://journals.plos.org/ploscompbiol/article?rev=2&id=10.1371/journal.pcbi.1007788">2020</a>). Climbing Escher’s stairs: A way to approximate stability landscapes in multidimensional systems. <em>PLoS Computational Biology, 16</em>(4), e1007788.</li>
<li>Scheffer, M., Bascompte, J., Brock, W. A., Brovkin, V., Carpenter, S. R., Dakos, V., … & Sugihara, G. (<a href="https://www.nature.com/articles/nature08227">2009</a>). Early-warning signals for critical transitions. <em>Nature, 461</em>(7260), 53-59.</li>
<li>Scheffer, M., Carpenter, S. R., Lenton, T. M., Bascompte, J., Brock, W., Dakos, V., … & Pascual, M. (<a href="https://science.sciencemag.org/content/338/6105/344.abstract">2012</a>). Anticipating critical transitions. <em>Science, 338</em>(6105), 344-348.</li>
<li>Strogatz, S. H. (<a href="http://www.stevenstrogatz.com/books/sync-the-emerging-science-of-spontaneous-order">2003</a>). <em>Sync: How Order Emerges from Chaos in the Universe. Nature, and Daily Life</em>. Hachette Books.</li>
<li>Wissel, C. (<a href="https://pubmed.ncbi.nlm.nih.gov/28312117/">1984</a>). A universal law of the characteristic return time near thresholds. <em>Oecologia, 65</em>(1), 101-107.</li>
<li>Van der Maas, H. L., Kolstein, R., & Van Der Pligt, J. (<a href="https://journals.sagepub.com/doi/abs/10.1177/0049124103253773">2003</a>). Sudden transitions in attitudes. <em>Sociological Methods & Research, 32</em>(2), 125-152.</li>
<li>Zeeman, E. C. (<a href="https://www.jstor.org/stable/24950329">1976</a>). Catastrophe theory. <em>Scientific American, 234</em>(4), 65-83.</li>
</ul>
<hr />
<h1 id="footnotes">Footnotes</h1>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Unless the system exhibits chaos and we cannot measure the system with perfect precision, but chaos should not concern us here. For a very gentle introduction to chaos and dynamical system, I recommend <a href="https://www.complexityexplorer.org/courses/105-introduction-to-dynamical-systems-and-chaos">this course</a> from the Santa Fe Institute. If you have some math background, I recommend Strogatz’s <a href="https://www.youtube.com/watch?v=ycJEoqmQvwg&list=PLbN57C5Zdl6j_qJA-pARJnKsmROzPnO9V">recorded lectures</a> and his book. If you are interested in learning about <em>complex systems</em>, see this <a href="https://complexityexplained.github.io/">wonderful introduction</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>For a very insightful book on Malthus, his influence on economics and the environmental movement, and <em>limits</em> more generally, see Kallis (<a href="https://www.sup.org/books/title/?id=29999">2019</a>). <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Of course, populations do not grow <em>continuously</em>, but rather through discrete birth and death events. For a population with many individuals, however, the assumption of continuity provides a good approximation because the spacing between births and deaths is so short. In 2016, for example, we had approximately <a href="https://en.wikipedia.org/wiki/Birth_rate">4.3 births</a> <em>per second</em> in the human population. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>The logistic equation can be solved analytically. For a derivation, see <a href="https://fabiandablander.com/r/Nonlinear-Infection.html#analytic-solution">here</a>, which uses $N$ instead of our $x$. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>This simple model does not incorporate the predator species explicitly, instead using parameters $\alpha$ and $\gamma$ to incorporate predation. Another classic article in ecology is Ludwig, Jones, & Holling (<a href="https://www.jstor.org/stable/3939">1978</a>), which use the same model but extend it to study sudden budworm outbreaks in forests. I highly recommend reading this article — there’s a lot in there. Abbott et al. (<a href="https://link.springer.com/article/10.1007/s12080-019-00441-x">2020</a>), who trace the impact of Ludwig et al. (<a href="https://www.jstor.org/stable/3939">1978</a>), is also an insightful read. Apparently, Alan Hastings suggested that the journal <em>Theoretical Ecology</em> print modern commentaries on classic papers. I think this would be a valuable idea for many other fields and journals to adopt! <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>For ease of illustration, I have added <em>additive</em> noise. However, this can lead to population values $x < 0$ or $x > 1$, which are physically impossible. Hence it would make more sense to add <em>multiplicative</em> noise, but it does not really matter for our purposes. Similarly, the proper way to write this down is in the form of <em>stochastic</em> differential equations, but that, too, does not really matter for our purposes. <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>The greek symbol $\lambda$ is usually used for <em>eigenvalues</em> of matrices. What matrix? In the unidimensional case, $f’(x^\star)$ is in fact the $1 \times 1$ dimensional Jacobian matrix of the system evaluated at the fixed point. For this $1 \times 1$ scalar matrix, the eigenvalue is the element of the matrix itself; hence I use the term $\lambda$. <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Fabian DablanderDynamical systems theory provides a unifying framework for studying how systems as disparate as the climate and the behaviour of humans change over time. In this blog post, I provide an introduction to some of its core concepts. Since the study of dynamical systems is vast, I will barely scratch the surface, focusing on low-dimensional systems that, while rather simple, nonetheless show interesting properties such as multiple stable states, critical transitions, hysteresis, and critical slowing down. While I have previously written about linear differential equations (in the context of love affairs) and nonlinear differential equations (in the context of infectious diseases), this post provides a gentler introduction. If you have not been exposed to dynamical systems theory before, you may find this blog post more accessible than the other two. The bulk of this blog post may be read as a preamble to Dablander, Pichler, Cika, & Bacilieri (2020), who provide an in-depth discussion of early warning signals and critical transitions. I recently gave a talk on this work and had the chutzpah to have it be recorded (with slides available from here). The first thirty minutes or so cover part of what is explained here, in case you prefer frantic hand movements to the calming written word. But without any further ado, let’s dive in! Differential equations Dynamical systems are systems that change over time. The dominant way of modeling how such systems change is by means of differential equations. Differential equations relate the rate of change of a quantity $x$ — which is given by the time derivative $\frac{\mathrm{d}x}{\mathrm{d}t}$ — to the quantity itself: \[\frac{\mathrm{d}x}{\mathrm{d}t} = f(x) \enspace .\] If we knew the function $f$, then this differential equation would give us the rate of change for any value of $x$. We are not particularly interested in this rate of change per se, however, but at the value of $x$ as a function of time $t$. We call the function $x(t)$ the solution of the differential equation. Most differential equations cannot be solved analytically, that is, we cannot get a closed-form expression of $x(t)$. Instead, differential equations are frequently solved numerically. How the system changes as a function of time, given by $x(t)$, is implicitly encoded in the differential equation. This is because, given any particular value of $x$, $f(x)$ tells us in which direction $x$ will change, and how quickly. It is this fact that we exploit when numerically solving differential equations. Specifically, given an initial condition $x_0 \equiv x(t = 0)$, $f(x_0)$ tells us in which direction and how quickly the system is changing. This suggests the following approximation method: \[x_{n + 1} = x_n + \Delta_t \cdot f(x_n) \enspace ,\] where $n$ indexes the set {$x_0, x_1, \ldots$} and $\Delta_t$ is the time that passes between two iterations. This is the most primitive way of numerically solving differential equations, known as Euler’s method, but it will do for this blog post. The derivative $\frac{\mathrm{d}x}{\mathrm{d}t}$ tells us how $x$ changes in an infinitesimally small time interval $\Delta_t$, and so for sufficiently small $\Delta_t$ we can get a good approximation of $x(t)$. In computer code, Euler’s method looks something like this:Estimating the risks of partying during a pandemic2020-07-22T09:30:00+00:002020-07-22T09:30:00+00:00https://fabiandablander.com/r/Corona-Party<p><em>This blog post was originally published on July $22^{\text{th}}$, but was updated on August $9^{\text{th}}$ to compare the risks of partying in Amsterdam, Barcelona, and London using the most recent coronavirus case numbers.</em></p>
<p>There is no doubt that, every now and then, one ought to celebrate life. This usually involves people coming together, talking, laughing, dancing, singing, shouting; simply put, it means throwing a party. With temperatures rising, summer offers all the more incentive to organize such a joyous event. Blinded by the light, it is easy to forget that we are, unfortunately, still in a pandemic. But should that really deter us?</p>
<p>Walking around central Amsterdam after sunset, it is easy to notice that not everybody holds back. Even if my Dutch was better, it would likely still be difficult to convince groups of twenty-somethings of their potential folly. Surely, they say, it is exceedingly unlikely that this little party of ours results in any virus transmission?</p>
<p>Government retorts by shifting perspective: while the chances of virus spreading at any one party may indeed be small, this does not licence throwing it. Otherwise many parties would mushroom, considerably increasing the chances of virus spread. Indeed, government stresses, this is why such parties remain <em>illegal</em>.</p>
<p>But while <em>if-everybody-did-what-you-did</em> type of arguments score high with parents, they usually do no score high with their children. So instead, in this post, we ask the question from an individual’s perspective: what are the chances of getting the virus after attending this or that party? And what factors make this more or less likely?</p>
<p>As a disclaimer, I should say that I am not an epidemiologist — who, by the way, are a <a href="https://www.nytimes.com/interactive/2020/06/08/upshot/when-epidemiologists-will-do-everyday-things-coronavirus.html">more cautious bunch</a> than I or the majority of my age group — and so my assessment of the evidence may not agree with expert opinion. With that out of the way, and without further ado, let’s dive in.</p>
<h1 id="risky-business">Risky business?</h1>
<p>To get us started, let’s define the <em>risk of a party</em> as the probability that somebody who is infected with the novel coronavirus and can spread it attends the gathering. The two major factors influencing this probability are the size of the party, that is, the number of people attending the gathering; and the prevalence of infectious people in the relevant population. As we will see, the latter quantity is difficult to estimate. The probability of actually getting infected by a person who has the coronavirus depends further on a number of factors; we will discuss those in a later section.</p>
<p>Let’s compare the risk of partying across three wonderful European cities: Amsterdam, Barcelona, and London. From July $22^{\text{nd}}$ to August $4^{\text{th}}$, a total of $563$, $3301$, and $1101$ new infections were reported (see <a href="https://www.rivm.nl/en/novel-coronavirus-covid-19/current-information">here</a>, <a href="https://dadescovid.cat/diari?drop_es_residencia=2&tipus=regio&id_html=ambit_2&codi=13">here</a>, <a href="https://coronavirus.data.gov.uk/">here</a>, and the <em>Post Scriptum</em>). This results in a relative case count of $64.50$, $203.72$, and $12.54$ per $100,000$ inhabitants, respectively. While these are the numbers of <em>reported new infected</em> cases, they are not the numbers of <em>currently infectious</em> cases. How do we arrive at those?</p>
<h1 id="estimating-the-true-number-of-infectious-cases">Estimating the true number of infectious cases</h1>
<p>Upon infection, it usually takes a while until one can infect others, with <a href="https://theconversation.com/how-long-are-you-infectious-when-you-have-coronavirus-135295">estimates ranging</a> from $1$ - $3$ days before showing symptoms. The <em>incubation period</em> is the time it takes from getting infected to showing symptoms. It lasts about $5$ days on average, with the vast majority of people showing symptoms within $12$ days (Lauer et al., <a href="https://www.acpjournals.org/doi/10.7326/M20-0504">2020</a>). Yet about a third to a half of people can be infectious without showing any symptoms (Pollán et al. <a href="https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31483-5">2020</a>; He et al. <a href="https://www.nature.com/articles/s41591-020-0869-5">2020</a>). <a href="https://theconversation.com/how-long-are-you-infectious-when-you-have-coronavirus-135295">Estimates suggest</a> that one is infectious for about $8$ - $10$ days, but it can be longer.</p>
<p>These are complications, but we need to keep it simple. Currently, visitors from outside Europe must show a negative COVID-19 test or need to self-isolate for $14$ days upon arrival in most European countries (see <a href="https://www.austria.info/en/service-and-facts/coronavirus-information">Austria</a>, for an example). Let’s take these $14$ days for simplicity, and assume conservatively that this is the time one is infectious upon getting infected. Thus, we simply take the reported number of <em>new infected</em> cases in the last two weeks as the reported number of <em>currently infectious</em> cases.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>We have dealt with the first complication, but a second one immediately follows: how do we get from the <em>reported</em> number of infections to the <em>true</em> number of infections? One can estimate the true number of infections using models, or by empirically estimating the seroprevalence in the population, that is, the proportion of people who have developed antibodies.</p>
<p>Using the first approach, Flaxman et al. (<a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>) estimate the total percentage of the population that has been infected — the <em>attack rate</em> — across $11$ European countries as of May $4^{\text{th}}$. The Netherlands was, unfortunately, not included in these estimates, and so we focus on Spain and the UK. For these countries the estimated true attack rates were $5.50\%$ and $5.10\%$, respectively. Given the population of these countries and the cumulative number of reported infections, we can compute the <em>reported</em> attack rate. Relating this to the estimate of the <em>true</em> attack rate gives us an indication of the extent that the report undercounts the actual infections; the code below calculates this for Spain and the UK.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'COVID19'</span><span class="p">)</span><span class="w">
</span><span class="n">get_undercount</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">country</span><span class="p">,</span><span class="w"> </span><span class="n">attack_rate</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># Flaxman et al. (2020) estimate the attack rate as of 4th of May</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">covid19</span><span class="p">(</span><span class="n">country</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">country</span><span class="p">,</span><span class="w"> </span><span class="n">end</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'2020-05-04'</span><span class="p">,</span><span class="w"> </span><span class="n">verbose</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">id</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">population</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">head</span><span class="p">(</span><span class="n">population</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">total_cases</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">tail</span><span class="p">(</span><span class="n">confirmed</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">mutate</span><span class="p">(</span><span class="w">
</span><span class="n">attack_rate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">attack_rate</span><span class="p">,</span><span class="w">
</span><span class="n">reported_attack_rate</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">100</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">total_cases</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">population</span><span class="p">,</span><span class="w">
</span><span class="n">undercount_factor</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">attack_rate</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">reported_attack_rate</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">get_undercount</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="s1">'Spain'</span><span class="p">,</span><span class="w"> </span><span class="s1">'United Kingdom'</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">5.5</span><span class="p">,</span><span class="w"> </span><span class="m">5.10</span><span class="p">))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## # A tibble: 2 x 6
## id population total_cases attack_rate reported_attack_rate undercount_factor
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 ESP 46796540 218011 5.5 0.466 11.8
## 2 GBR 66460344 191843 5.1 0.289 17.7</code></pre></figure>
<p>The table above shows that cases were undercounted by a factor of about $12$ in Spain and $18$ in the UK. The Netherlands undercounted cases by a factor of about $10$ in April (Luc Coffeng, personal communication). The attack rate estimate for Spain is confirmed by a recent seroprevalence study, which finds a similarly low overall proportion of people who have developed antibodies (around $5\%$, with substantial geographical variability) in the period between April $27^{\text{th}}$ and May $11^{\text{th}}$ (Pollán et al. <a href="https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31483-5">2020</a>). In another seroprevalence study, Havers et al. <a href="https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2768834">2020</a> find that between March $23^{\text{rd}}$ and May $12^{\text{nd}}$, reported cases from several areas in the United States undercounted true infections by a factor between $6$ and $24$.</p>
<p>Currently, the pandemic is not as severe in Europe as it was back when the above studies were conducted. Most importantly, the testing capacity has been ramped up in most countries. For example, while the proportion of positive tests in the Netherlands and the UK were $9.40\%$ and $6.60\%$ on the $4^{\text{th}}$ of May (the end date used in the Flaxman et al. 2020 study), they currently are $1.70\%$ and $0.60\%$, using most recent data from <a href="https://ourworldindata.org/coronavirus-testing">here</a> at the time of writing. Spain’s coronavirus cases peaked roughly two weeks earlier than those of the Netherlands and the UK, and so by May $4^{\text{th}}$ they had a positivity rate of $2.60\%$. By July $30^{\text{th}}$ — the date the most recent data is available at the time of writing — their positivity rate has nearly doubled, to $5\%$.</p>
<p>Thus, while the Netherlands and the UK seem to be tracking the epidemic much more closely, which in turn implies that the factor by which they undercount the true cases is likely lower than it was previously, Spain seems to be actually doing <em>worse</em>.</p>
<p>Cases <a href="https://ourworldindata.org/coronavirus/country/united-kingdom?country=NLD~ESP~GBR">are rising again</a>. Let’s assume therefore that the true number of <em>infectious</em> cases is $5$ times higher then the number of reported <em>infected</em> cases. For simplicity, we assume the same factor for all countries, although Spain is likely undercounting the number of true cases by a larger factor than both the Netherlands and the UK. We assume the estimated relative <em>true</em> number of <em>currently infectious</em> cases to therefore be $5 \times 64.50 = 322.50$, $5 \times 203.72 = 1018.60$, and $5 \times 12.54 = 62.70$ per $100,000$ residents in Amsterdam, Barcelona, and London, respectively. This includes asymptomatic carriers or those that are pre-symptomatic but still can spread the virus. We will assess how robust our results are against this particular correction factor later; in the next section, we estimate the risk of a party.</p>
<h1 id="estimating-the-risk-of-a-party">Estimating the risk of a party</h1>
<p>What are the chances that a person who attends your party has the coronavirus and is infectious? To calculate this, we assume that party guests form an independent random sample from the population. We will discuss the implications of this crude assumption later; but for now, it allows us to estimate the desired probability in a straightforward manner.</p>
<p>Take Amsterdam as an example. There were $64.50$ reported new cases per $100,000$ inhabitants between July $22^{\text{nd}}$ and August $4^{\text{th}}$. As discussed above, we take $5 \times 64.50 = 322.50$ to be the number of <em>true infectious cases</em> per $100,000$ inhabitants. Assuming that the probability of infection is the same for all citizens (more on this later), this results in $322.50 / 100,000 = 0.003225$, which gives a $0.3225\%$ or $1$ in $310$ chance that a <em>single</em> party guest has the virus and can spread it.</p>
<p>A party with just one guest would be — <em>intimate</em>. So let’s invite a few others. What are the chances that <em>at least one</em> of them can spread the virus? We compute this by first computing the complement, that is, the probability that <em>no</em> party guest is infectious.</p>
<p>The chance that any one person from Amsterdam is not infectious is $1 - 0.003225 = 0.996775$, or $99.68\%$. With our assumption of guests forming an independent random sample from the population, the probability that none of the $n$ guests can spread the virus is $0.996775^n$.</p>
<p>In our simple calculations, the chances of at least one infectious guest showing up depends only on the size of the party and the number of true infectious cases. The figure below visualizes how these two factors interact to give the risk of a party (see Lachmann & Fox, <a href="https://www.santafe.edu/research/projects/transmission-sfi-insights-covid-19">2020</a>, for a similar analysis regarding school reopenings).</p>
<p><img src="/assets/img/2020-07-22-Corona-Party.Rmd/risk plot sensitivity-1.png" title="plot of chunk risk plot sensitivity" alt="plot of chunk risk plot sensitivity" style="display: block; margin: auto;" /></p>
<p>Let’s take a moment to unpack this figure. Each coloured line represents a combination of estimated true number of infectious cases and party size that yields the same party risk. For example, attending a party of size $20$ when the true number of infectious cases per $100,000$ inhabitants is $50$ yields a party risk of about $1\%$, but so would, roughly, attending a party of size $10$ when the true relative number of infectious cases is $100$. Thus, there is a trade-off between the size of the party and the true number of infectious cases.</p>
<p>You can get a quick overview of the risks of parties of different sizes for a fixed number of true infectious cases by checking when the gray solid lines <em>verticallly</em> cross the coloured lines. Similarly, you can get a rough understanding for the risks of a party of fixed size for different numbers of true infectious cases by checking when the gray and coloured lines cross <em>horizontally</em>. The dotted vertical lines in the figure gives our previous estimate of the true number of infectious cases for London, Barcelona, and Amsterdam.</p>
<p>What’s the risk of partying in those three cities? For gatherings of size $10$, $25$, and $50$, the probability that at least one guest arrives infectious is $0.63\%$, $1.56\%$, and $3.09\%$ for London. For Amsterdam, the risks are substantially higher, with $3.18\%$, $7.76\%$, and $14.91\%$. Barcelona performs worst, with staggering risks of $9.73\%$, $22.58\%$, and $40.07\%$. These numbers are sobering, and I want you to take a moment to let them sink in. We will discuss the assumptions we had to make in order to arrive at them in the next section.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup></p>
<p>Do you happen to live neither in London, Amsterdam, nor Barcelona? Regardless of your area of residence, the figure allows you to estimate the party risk; just look up the local number of new cases in the last two weeks, multiply with a correction factor (we used $5$), and — making the assumptions we have made so far — the plot above gives you the probability that at least one party guest will turn up infectious with the coronavirus. The assumptions we have made are very simplistic, and indeed, if you have a more elaborate way of estimating the number of currently infectious cases, then you can use that number combined with the figure to estimate the party risk.</p>
<p>Take Rome and Berlin, for example. From July $22^{\text{nd}}$ to August $4^{\text{th}}$, they had $4.82$ and $15.63$ cases per $100,000$ inhabitants, respectively (see the <em>Post Scriptum</em>). Making the same assumptions as with the other cities, and using a correction factor of $5$, the probability of having at least one infectious guest attending a party of size $25$ are $0.60\%$ for Rome and $1.93\%$ for Berlin, respectively. The table below gives the risk for parties of size $10$, $25$, $50$, and $100$ in the five European cities.</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Rome London Berlin Amsterdam Barcelona
## 10 0.24 0.63 0.78 3.18 9.73
## 25 0.60 1.57 1.93 7.76 22.58
## 50 1.20 3.12 3.83 14.91 40.07
## 100 2.38 6.14 7.52 27.60 64.08</code></pre></figure>
<p>While we have computed the party risk for a single party, this risk naturally increases when you attend multiple ones. Suppose you have been invited to parties of size $20$, $35$, and $50$ which will take place in the next month. Let’s for simplicity assume that all guests are different each time. Let’s further assume that the number of infectious cases stays constant over the next month. Together, these assumptions allow us to calculate the <em>total</em> party risk as the party risk of attending a single party of size $20 + 35 + 50 = 105$, which gives a considerable risk of $6.37\%$ for London, a whopping $28.76\%$ for Amsterdam, and a crippling $65.87\%$ for Barcelona. It seems that, in this case, fortune does not favour the bold.</p>
<h2 id="assumptions">Assumptions</h2>
<p>The analysis above is a very rough <em>back-of-the-envelope</em> calculation. We have made a number of crucial assumptions to arrive at some numbers. That’s useful as a first approximation; now we have at least some intuition for the problem, and we can critically discuss the assumptions we made. Most importantly, do these assumptions lead to <em>overestimates</em> or <em>underestimates</em> of the party risk?</p>
<h3 id="independence">Independence</h3>
<p>First, and most critically, we have assumed that party guests are <em>randomly and independently</em> drawn from the population. It is this assumption that allowed us to compute the joint probability that none of the party guests have the virus by multiplying the probabilities of any individual being virus-free. If you have ever been to a party, you know that this is not true: instead, a considerable number of party guests usually know each other, and it is safe to say that they are similar on a range of socio-demographic variables such as age and occupation.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote">3</a></sup></p>
<p>This means we are sampling not from the whole population, as our simple calculation assumes, but from some particular subpopulation that is well connected. Since the party guests likely share social circles or even households, the <em>effective</em> party size — in terms of being relevant for virus transmission — is smaller than the <em>actual</em> party size; this is because these individuals share the same risks. A party with $20$ married couples seems safer than a party with $40$ singles. This would suggest that we overestimate the risk of a party.</p>
<h3 id="uniform-infection-probability">Uniform infection probability</h3>
<p>At the same time, however, our calculations assume that the risk of getting the coronavirus is evenly spread across the population. We used this fact when estimating the probability that any one person has the coronavirus as the total number of cases divided by the population size.</p>
<p>The probability of infection is not, howevever, evenly distributed. For example, Pollán et al. (<a href="https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31483-5">2020</a>) report a seroprevalence of people aged $65$ or more of about $6\%$, while people aged between $20$ and $34$ showed a seroprevalence of $4.4\%$ between April $27^{\text{th}}$ to May $11^{\text{th}}$. These days, however, there is a substantial rise in young people who become infected, in the <a href="https://www.nytimes.com/2020/06/25/us/coronavirus-cases-young-people.html">United States</a> and likely also in Europe. Because young people are <a href="https://www.axios.com/coronavirus-young-people-spread-5a0cd9e0-1b25-4c42-9ef9-da9d9ebce367.html">less likely to develop symptoms</a>, the virus can spread largely undetected.</p>
<p>Moreover, it seems to me that people who would join a party are in general more <em>adventurous</em>. This would increase the chances of an infectious person attending a party; thus, our calculation above may in fact underestimate the party risk.</p>
<p>At the same time, one would hope that people who show symptoms avoid parties. If all guests do so, then only pre-symptomatic or asymptomatic spread can occur, which would reduce the party risk by a half up to two thirds. On the flip side, people who show symptoms might get tested for COVID-19 and, upon receiving a negative test, consider it safe to attend a party. This might be foolish, however; recent estimates suggest that tests miss about $20\%$ infections for people who show symptoms (Kucirka et al, <a href="https://www.acpjournals.org/doi/10.7326/M20-1495">2020</a>; see also <a href="https://www.advisory.com/daily-briefing/2020/07/06/negative-covid">here</a>). For people without symptoms, the test performs <a href="https://www.theatlantic.com/science/archive/2020/06/how-negative-covid-19-test-can-mislead/613246/">even worse</a>.</p>
<p>For parties taking place in summer, it is not unlikely that many guests engaged in holiday travel in the days or weeks before the date of the party. Since travel increases the chances of infection, this would further increase the chances that at least one party guest has contracted the coronavirus.</p>
<h3 id="estimating-true-infections">Estimating true infections</h3>
<p>We have assumed that the <em>reported</em> number of <em>new infected</em> cases in the last two weeks equals the number of <em>currently infectious</em> cases. This is certainly an approximation. Ideally, we would have a geographically explicit model which, at any point in time and space, provides an estimate of the number of infectious cases. To my knowledge, we are currently lacking such a model.</p>
<p>Note that, if the people who tested positive all self-isolate or, worse, end up in hospital, this clearly curbs virus spread compared to when they would continue to roam around. The former seems more likely. Moreover, these reported cases are likely not independent either, with outbreaks usually being localized. Similar to the fact that party guests know each other, the fact that reported cases cluster would lead us to overestimate the extent of virus spread at a party.</p>
<p>At the same time, in the Netherlands, for example, only those that show symptoms can get tested. Since about a third to a half are asymptomatic or pre-symptomatic in the sense that they spread the virus considerably before symptom onset, the reported number of cases likely gives an undercount of infectious people.</p>
<p>All these complications can be summarized, roughly, in the correction factor, which gives the extent to which we believe that the reported number of infected cases undercounts the true number of currently infectious cases. We have focused on a factor of $5$, but the figure above allows you to assess the sensitivity of the results to this particular choice.</p>
<p>For a very optimistic factor of $1$ — this means that we do not undercount the true cases — a party of size $30$ results in a risk of $0.37\%$ for London, $1.92\%$ for Amsterdam, and $5.93\%$ for Barcelona. A factor of $5$ results in risks of $1.82\%$, $9.24\%$, and $26.45\%$, respectively. A conservative estimate, using a factor of $10$, results in risks of $3.61\%$, $17.65\%$, and $46.07\%$. You can play around with these numbers yourself. Observe how they make you feel. Personally, given what we said above about the infection probability for young and adventurous people, I am inclined to err on the side of caution.</p>
<h2 id="estimating-the-probability-of-infection">Estimating the probability of infection</h2>
<p>We have the defined the party risk as the probability that at least one party guest has the coronavirus and is infectious. If this person does not spread the virus to other guests, no harm is done.</p>
<p>This is exceedingly unlikely, however. The probability of getting infected is a function of the time one is exposed to the virus, and the amount of virus one is exposed to. <a href="https://www.erinbromage.com/post/the-risks-know-them-avoid-them">Estimates suggest</a> that about $1,000$ SARS-CoV-2 infectious virus particles suffice for an infection. With breathing, about $20$ viral particles diffuse into the environment per minute; this increases to $200$ for speaking; coughing or sneezing can release $200,000,000$ (!) virus particles. These do not all fall to the ground, but instead can remain suspended in the air and fill the whole room; thus, physical distancing alone might <a href="https://www.nytimes.com/2020/07/06/health/coronavirus-airborne-aerosols.html">not be enough indoors</a> (the extent of airborne transmission remains debated, however; see for example Klompas, Baker, & Rhee, <a href="https://jamanetwork.com/journals/jama/fullarticle/2768396">2020</a>). It seems reasonable to assume that, when party guests are crowded in a room for a number of hours, many of them stand a good chance of getting infected if at least one guest is infectious. <a href="https://www.erinbromage.com/post/what-s-the-deal-with-masks">Masks would help</a>, of course; but how would I sip my Negroni, wearing one?</p>
<p>It is different outdoors. A Japanese study found that virus transmission inside was about $19$ times more likely than outside (Nishiura et al. <a href="https://www.medrxiv.org/content/10.1101/2020.02.28.20029272v2">2020</a>). Analyzing $318$ outbreaks in China between January $4^{\text{th}}$ and February $11^{\text{th}}$, Quian et al. (<a href="https://www.medrxiv.org/content/10.1101/2020.04.04.20053058v1">2020</a>) found that only a single one occurred outdoors. This suggests that parties outdoors should be much safer than parties indoors. Yet outdoor parties feature elements unlike other outdoor events; for example, there are areas — such as bars or <a href="https://www.nytimes.com/2020/06/24/style/coronavirus-public-bathrooms.html">public toilets</a> — which could become spots for virus transmission. They usually attract more people, too. Our simple calculations suggest, with a correction factor of $5$, that the probability that at least one person out of $150$ has the coronavirus is a staggering $38.40\%$ in Amsterdam. While, in contrast to an indoor setting, the infected person is unlikely to infect the majority of the other guests, it seems likely that at least some guests will get the virus.</p>
<h1 id="to-party-or-not-to-party">To party or not to party?</h1>
<p>If I do not care whether I get wet or not, I will never carry an umbrella, regardless of the chances of rain. Similarly, my decision to throw (or attend) a party requires not only an estimate of how likely it is that the virus spreads at the gathering; it also requires an assessment of how much I actually care.</p>
<p>As argued above, it is almost certain that the virus spreads to other guests if one guest arrives infectious. Noting that all guests are young, one might be tempted to argue that the cost of virus spread is low. In fact, people who party might even be helping — <em>heroically</em> — to build <a href="https://fabiandablander.com/r/Covid-Exit.html">herd immunity</a>!</p>
<p>This reasoning is foolish on two grounds. First, while the proportion of infected people who die is very small for young people — Salje et al. (<a href="https://science.sciencemag.org/content/369/6500/208">2020</a>) estimate it to be $0.0045\%$ for people in their twenties and $0.015\%$ for people in their thirties — the picture about the non-lethal, long-term effects of the novel coronavirus is only slowly becoming clear. For some people, recovery can be <a href="https://www.theatlantic.com/health/archive/2020/06/covid-19-coronavirus-longterm-symptoms-months/612679/">lengthy</a> — much longer than the two weeks we previously believed it would take. Known as “mild” cases, they <a href="https://www.theguardian.com/commentisfree/2020/jul/06/coronavirus-covid-19-mild-symptoms-who">might not be so mild after all</a>. Moreover, the potential <a href="https://www.bbc.com/future/article/20200622-the-long-term-effects-of-covid-19-infection">strange neurological effects</a> of a coronavirus infection are becoming increasingly apparent. All told, party animals, even those guarded by their youth, might not shake it off so easily.</p>
<p>Suppose that, even after carefully considering the potential health dangers, one is still willing to take the chances. After all, it would be a <em>really</em> good party, and we young people usually eat our veggies — especially in Amsterdam. The trouble with infectious diseases, though, is that they travel: while you might be happy to take a chance, you and the majority of party guests will probably not self-isolate after the event, right? If infections occur at the party, the virus is thus likely to subsequently spread to other, more vulnerable parts of the population.</p>
<p>So while you might remain unharmed after attending a party, others might not. Take the story of <a href="https://www.erinbromage.com/post/the-risks-know-them-avoid-them">Bob from Chicago</a>, summarizing an actual infection chain reported by Ghinai et al. (<a href="https://www.cdc.gov/mmwr/volumes/69/wr/mm6915e1.htm">2020</a>):</p>
<blockquote><p>"Bob was infected but didn't know. Bob shared a takeout meal, served from common serving dishes, with $2$ family members. The dinner lasted $3$ hours. The next day, Bob attended a funeral, hugging family members and others in attendance to express condolences. Within $4$ days, both family members who shared the meal are sick. A third family member, who hugged Bob at the funeral became sick. But Bob wasn't done. Bob attended a birthday party with $9$ other people. They hugged and shared food at the $3$ hour party. Seven of those people became ill.</p>
<p>But Bob's transmission chain wasn’t done. Three of the people Bob infected at the birthday went to church, where they sang, passed the tithing dish etc. Members of that church became sick. In all, Bob was directly responsible for infecting $16$ people between the ages of $5$ and $86$. Three of those $16$ died."</p></blockquote>
<p>These events took place before much of the current corona measures were put in place, but the punchline remains: parties are a matter of public, not only individual health. Don’t be like Bob.</p>
<hr />
<p>I want to thank <a href="https://dennyborsboom.com/">Denny Borsboom</a> and <a href="https://twitter.com/luc_coffeng">Luc Coffeng</a> for helpful discussions. I also want to thank <a href="https://www.inet.ox.ac.uk/people/andrea-bacilieri/">Andrea Bacilieri</a>, <a href="https://twitter.com/BorsboomDenny">Denny Borsboom</a>, <a href="https://www.facebook.com/nextgendoctors">Tom Dablander</a>, and <a href="https://twitter.com/CharlotteCTanis">Charlotte Tanis</a> for helpful comments on a previous version of this blog post.</p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="data">Data</h3>
<p>The code below gives the data used in the main text.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'httr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">library</span><span class="p">(</span><span class="s1">'COVID19'</span><span class="p">)</span><span class="w">
</span><span class="c1"># See https://coronavirus.data.gov.uk/developers-guide</span><span class="w">
</span><span class="n">endpoint</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="w">
</span><span class="s1">'https://api.coronavirus.data.gov.uk/v1/data?'</span><span class="p">,</span><span class="w">
</span><span class="s1">'filters=areaType=region;areaName=London&'</span><span class="p">,</span><span class="w">
</span><span class="s1">'structure={"date":"date","newCases":"newCasesBySpecimenDate"}'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">response</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">GET</span><span class="p">(</span><span class="n">url</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">endpoint</span><span class="p">,</span><span class="w"> </span><span class="n">timeout</span><span class="p">(</span><span class="m">10</span><span class="p">))</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">response</span><span class="o">$</span><span class="n">status_code</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="m">400</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">err_msg</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">http_status</span><span class="p">(</span><span class="n">response</span><span class="p">)</span><span class="w">
</span><span class="n">stop</span><span class="p">(</span><span class="n">err_msg</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Convert response from binary to JSON:</span><span class="w">
</span><span class="n">json_text</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">content</span><span class="p">(</span><span class="n">response</span><span class="p">,</span><span class="w"> </span><span class="s1">'text'</span><span class="p">)</span><span class="w">
</span><span class="n">data</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">jsonlite</span><span class="o">::</span><span class="n">fromJSON</span><span class="p">(</span><span class="n">json_text</span><span class="p">)</span><span class="o">$</span><span class="n">data</span><span class="w">
</span><span class="c1"># From 22nd July to 4th August</span><span class="w">
</span><span class="n">london_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="w">
</span><span class="n">date</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="s1">'2020-07-22'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="s1">'2020-08-04'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">london_total_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">london_dat</span><span class="o">$</span><span class="n">newCases</span><span class="p">)</span><span class="w">
</span><span class="n">london_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">london_total_cases</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">89.82000</span><span class="w"> </span><span class="c1"># per 100,000 inhabitants</span><span class="w">
</span><span class="c1"># From https://www.rivm.nl/en/novel-coronavirus-covid-19/current-information</span><span class="w">
</span><span class="n">amsterdam_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">64.5</span><span class="w">
</span><span class="n">amsterdam_total_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">563</span><span class="w">
</span><span class="c1"># https://dadescovid.cat/diari?drop_es_residencia=2&tipus=regio&id_html=ambit_2&codi=13</span><span class="w">
</span><span class="n">barcelona_total_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1502</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1745</span><span class="w">
</span><span class="n">barcelona_total_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="w">
</span><span class="nf">c</span><span class="p">(</span><span class="m">235</span><span class="p">,</span><span class="w"> </span><span class="m">263</span><span class="p">,</span><span class="w"> </span><span class="m">63</span><span class="p">,</span><span class="w"> </span><span class="m">82</span><span class="p">,</span><span class="w"> </span><span class="m">302</span><span class="p">,</span><span class="w"> </span><span class="m">327</span><span class="p">,</span><span class="w"> </span><span class="m">279</span><span class="p">,</span><span class="w"> </span><span class="m">279</span><span class="p">,</span><span class="w"> </span><span class="m">353</span><span class="p">,</span><span class="w"> </span><span class="m">71</span><span class="p">,</span><span class="w"> </span><span class="m">99</span><span class="p">,</span><span class="w"> </span><span class="m">314</span><span class="p">,</span><span class="w"> </span><span class="m">307</span><span class="p">,</span><span class="w"> </span><span class="m">327</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">barcelona_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">barcelona_total_cases</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">16.20343</span><span class="w">
</span><span class="c1"># COVID19 has data on Italian provinces</span><span class="w">
</span><span class="c1"># https://www.nytimes.com/interactive/2020/world/europe/italy-coronavirus-cases.html</span><span class="w">
</span><span class="n">italy_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">covid19</span><span class="p">(</span><span class="s1">'Italy'</span><span class="p">,</span><span class="w"> </span><span class="n">level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="w">
</span><span class="c1"># From 22nd July to 4th August</span><span class="w">
</span><span class="n">date</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="s1">'2020-07-21'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="s1">'2020-08-04'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">rome_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">italy_dat</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">administrative_area_level_3</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'Roma'</span><span class="p">)</span><span class="w">
</span><span class="n">rome_total_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">rome_dat</span><span class="o">$</span><span class="n">confirmed</span><span class="p">))</span><span class="w">
</span><span class="n">rome_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="n">rome_total_cases</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">rome_dat</span><span class="o">$</span><span class="n">population</span><span class="p">[</span><span class="m">1</span><span class="p">])</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100000</span><span class="w">
</span><span class="c1"># COVID19 has data on German states</span><span class="w">
</span><span class="c1"># https://www.nytimes.com/interactive/2020/world/europe/germany-coronavirus-cases.html</span><span class="w">
</span><span class="n">germany_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">covid19</span><span class="p">(</span><span class="s1">'Germany'</span><span class="p">,</span><span class="w"> </span><span class="n">level</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">filter</span><span class="p">(</span><span class="w">
</span><span class="c1"># From 22nd July to 4th August</span><span class="w">
</span><span class="n">date</span><span class="w"> </span><span class="o">>=</span><span class="w"> </span><span class="s1">'2020-07-21'</span><span class="p">,</span><span class="w"> </span><span class="n">date</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="s1">'2020-08-04'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">berlin_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">germany_dat</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">filter</span><span class="p">(</span><span class="n">administrative_area_level_2</span><span class="w"> </span><span class="o">==</span><span class="w"> </span><span class="s1">'Berlin'</span><span class="p">)</span><span class="w">
</span><span class="n">berlin_total_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">diff</span><span class="p">(</span><span class="n">berlin_dat</span><span class="o">$</span><span class="n">confirmed</span><span class="p">))</span><span class="w">
</span><span class="n">berlin_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">berlin_total_cases</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">37.69495</span><span class="w">
</span><span class="nf">c</span><span class="p">(</span><span class="n">rome_cases</span><span class="p">,</span><span class="w"> </span><span class="n">london_cases</span><span class="p">,</span><span class="w"> </span><span class="n">berlin_cases</span><span class="p">,</span><span class="w"> </span><span class="n">amsterdam_cases</span><span class="p">,</span><span class="w"> </span><span class="n">barcelona_cases</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 4.821241 12.669784 15.625435 64.500000 203.722298</code></pre></figure>
<h3 id="figure">Figure</h3>
<p>The code below reproduces the figure in the main text.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'RColorBrewer'</span><span class="p">)</span><span class="w">
</span><span class="c1"># Probability that no guest is infectious</span><span class="w">
</span><span class="n">prob_virus_free</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">64.50</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">prob_virus</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">100000</span><span class="w">
</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">prob_virus</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Probability that at least one guest is infectious</span><span class="w">
</span><span class="n">party_risk</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">64.50</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">prob_virus_free</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Calculates the party size that results in 'prob_virus_free'</span><span class="w">
</span><span class="c1"># for a given 'true_relative_cases'</span><span class="w">
</span><span class="n">get_party_size</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">prob_virus_free</span><span class="p">,</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">log</span><span class="p">(</span><span class="n">prob_virus_free</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">100000</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot_total_risk</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">party_sizes</span><span class="p">,</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="w">
</span><span class="n">true_relative_cases</span><span class="p">,</span><span class="w"> </span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">xaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">yaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Estimated True Number of Infectious Cases per 100,000 Inhabitants'</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Party Size'</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">ticks_x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1200</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">minor_ticks_x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1200</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">)</span><span class="w">
</span><span class="n">ticks_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">300</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">)</span><span class="w">
</span><span class="n">minor_ticks_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">300</span><span class="p">,</span><span class="w"> </span><span class="m">25</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">at</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ticks_x</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.1</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">at</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ticks_y</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.1</span><span class="p">)</span><span class="w">
</span><span class="n">rug</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">minor_ticks_x</span><span class="p">,</span><span class="w"> </span><span class="n">ticksize</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.01</span><span class="p">,</span><span class="w"> </span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rug</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">minor_ticks_y</span><span class="p">,</span><span class="w"> </span><span class="n">ticksize</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">-0.01</span><span class="p">,</span><span class="w"> </span><span class="n">side</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">abline</span><span class="p">(</span><span class="n">h</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">minor_ticks_y</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray86'</span><span class="p">)</span><span class="w">
</span><span class="n">abline</span><span class="p">(</span><span class="n">v</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">minor_ticks_x</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray86'</span><span class="p">)</span><span class="w">
</span><span class="n">probs_virus</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="m">0.99</span><span class="p">,</span><span class="w"> </span><span class="m">0.01</span><span class="p">)</span><span class="w">
</span><span class="n">party_sizes</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sapply</span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">probs_virus</span><span class="p">,</span><span class="w"> </span><span class="n">get_party_size</span><span class="p">,</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="p">)</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rev</span><span class="p">(</span><span class="n">heat.colors</span><span class="p">(</span><span class="m">50</span><span class="p">))</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">colorRampPalette</span><span class="p">(</span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">bias</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)(</span><span class="m">99</span><span class="p">)[</span><span class="m">-1</span><span class="p">]</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="m">95</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">))</span><span class="w">
</span><span class="n">show_text</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ix</span><span class="w">
</span><span class="n">diagonal</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="w">
</span><span class="s1">'x'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1200</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">300</span><span class="p">),</span><span class="w">
</span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1200</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">300</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">3</span><span class="o">/</span><span class="m">12</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">ix</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">party_sizes</span><span class="p">[,</span><span class="w"> </span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">line</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="s1">'x'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="p">[</span><span class="m">-1</span><span class="p">],</span><span class="w"> </span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">[</span><span class="m">-1</span><span class="p">])</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">line</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">[</span><span class="n">i</span><span class="p">],</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">reconPlots</span><span class="o">::</span><span class="n">curve_intersect</span><span class="p">(</span><span class="n">line</span><span class="p">,</span><span class="w"> </span><span class="n">diagonal</span><span class="p">)</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="n">show_text</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="w">
</span><span class="n">j</span><span class="o">$</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">j</span><span class="o">$</span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">paste0</span><span class="p">(</span><span class="n">probs_virus</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="s1">'%'</span><span class="p">),</span><span class="w">
</span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">300</span><span class="p">)</span><span class="w">
</span><span class="n">true_relative_cases</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1200</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">ns</span><span class="p">))</span><span class="w">
</span><span class="n">plot_total_risk</span><span class="p">(</span><span class="w">
</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="n">true_relative_cases</span><span class="p">,</span><span class="w">
</span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Probability That at Least One Guest is Infectious'</span><span class="p">,</span><span class="w">
</span><span class="n">font.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.75</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.50</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">lwd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">london_cases</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">london_cases</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">5</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">300</span><span class="p">),</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">arrows</span><span class="p">(</span><span class="m">122</span><span class="p">,</span><span class="w"> </span><span class="m">160</span><span class="p">,</span><span class="w"> </span><span class="m">68</span><span class="p">,</span><span class="w"> </span><span class="m">130</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.10</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwd</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">135</span><span class="p">,</span><span class="w"> </span><span class="m">167</span><span class="p">,</span><span class="w"> </span><span class="s1">'London'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.50</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">amsterdam_cases</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">amsterdam_cases</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">5</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">300</span><span class="p">),</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">arrows</span><span class="p">(</span><span class="m">390</span><span class="p">,</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">330</span><span class="p">,</span><span class="w"> </span><span class="m">170</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.10</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwd</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">412</span><span class="p">,</span><span class="w"> </span><span class="m">207</span><span class="p">,</span><span class="w"> </span><span class="s1">'Amsterdam'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.50</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">barcelona_cases</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">barcelona_cases</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">5</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">500</span><span class="p">),</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">arrows</span><span class="p">(</span><span class="m">950</span><span class="p">,</span><span class="w"> </span><span class="m">167</span><span class="p">,</span><span class="w"> </span><span class="m">1012</span><span class="p">,</span><span class="w"> </span><span class="m">130</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.10</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">lwd</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">946</span><span class="p">,</span><span class="w"> </span><span class="m">173</span><span class="p">,</span><span class="w"> </span><span class="s1">'Barcelona'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.50</span><span class="p">)</span></code></pre></figure>
<h3 id="table">Table</h3>
<p>The code below reproduces the table in the main text.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">25</span><span class="p">,</span><span class="w"> </span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">)</span><span class="w">
</span><span class="n">rome_risk</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">party_risk</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">rome_cases</span><span class="p">)</span><span class="w">
</span><span class="n">london_risk</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">party_risk</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">london_cases</span><span class="p">)</span><span class="w">
</span><span class="n">berlin_risk</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">party_risk</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">berlin_cases</span><span class="p">)</span><span class="w">
</span><span class="n">amsterdam_risk</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">party_risk</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">amsterdam_cases</span><span class="p">)</span><span class="w">
</span><span class="n">barcelona_risk</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">party_risk</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">barcelona_cases</span><span class="p">)</span><span class="w">
</span><span class="n">tab</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="w">
</span><span class="n">rome_risk</span><span class="p">,</span><span class="w"> </span><span class="n">london_risk</span><span class="p">,</span><span class="w"> </span><span class="n">berlin_risk</span><span class="p">,</span><span class="w"> </span><span class="n">amsterdam_risk</span><span class="p">,</span><span class="w"> </span><span class="n">barcelona_risk</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">tab</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'Rome'</span><span class="p">,</span><span class="w"> </span><span class="s1">'London'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Berlin'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Amsterdam'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Barcelona'</span><span class="p">)</span><span class="w">
</span><span class="n">rownames</span><span class="p">(</span><span class="n">tab</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">n</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">tab</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Rome London Berlin Amsterdam Barcelona
## 10 0.24 0.63 0.78 3.18 9.73
## 25 0.60 1.57 1.93 7.76 22.58
## 50 1.20 3.12 3.83 14.91 40.07
## 100 2.38 6.14 7.52 27.60 64.08</code></pre></figure>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Flaxman, Mishra, Gandy et al. (<a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>). Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. <em>Nature</em>, 3164.</li>
<li>Ghinai, I., Woods, S., Ritger, K. A., McPherson, T. D., Black, S. R., Sparrow, L., … & Arwady, M. A. (<a href="https://www.cdc.gov/mmwr/volumes/69/wr/mm6915e1.htm">2020</a>). Community Transmission of SARS-CoV-2 at Two Family Gatherings-Chicago, Illinois, February-March 2020. <em>MMWR. Morbidity and mortality weekly report, 69</em>(15), 446.</li>
<li>Havers, F. P., Reed, C., Lim, T. W., Montgomery, J. M., Klena, J. D., Hall, A. J., … & Krapiunaya, I. (<a href="https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2768834">2020</a>). Seroprevalence of Antibodies to SARS-CoV-2 in Six Sites in the United States, March 23-May 3, 2020. <em>JAMA Internal Medicine</em>.</li>
<li>He, X., Lau, E. H., Wu, P., Deng, X., Wang, J., Hao, X., … & Mo, X. (<a href="https://www.nature.com/articles/s41591-020-0869-5">2020</a>). Temporal dynamics in viral shedding and transmissibility of COVID-19. <em>Nature Medicine, 26</em>(5), 672-675.</li>
<li>Klompas, M., Baker, M. A., & Rhee, C. (<a href="https://jamanetwork.com/journals/jama/fullarticle/2768396">2020</a>). Airborne Transmission of SARS-CoV-2: Theoretical Considerations and Available Evidence. <em>JAMA</em>.</li>
<li>Kucirka, L. M., Lauer, S. A., Laeyendecker, O., Boon, D., & Lessler, J. (<a href="https://www.acpjournals.org/doi/full/10.7326/M20-1495">2020</a>). Variation in false-negative rate of reverse transcriptase polymerase chain reaction–based SARS-CoV-2 tests by time since exposure. <em>Annals of Internal Medicine</em>.</li>
<li>Lachmann, M., & Fox, S. (<a href="https://sfi-edu.s3.amazonaws.com/sfi-edu/production/uploads/ckeditor/2020/07/07/t-034-lachmann.pdf">2020</a>). When thinking about reopening schools, an important factor to consider is the rate of community transmission. <em>Santa Fe Institute Transmission</em>.</li>
<li>Lauer, S. A., Grantz, K. H., Bi, Q., Jones, F. K., Zheng, Q., Meredith, H. R., … & Lessler, J. (<a href="https://www.acpjournals.org/doi/10.7326/M20-0504">2020</a>). The incubation period of coronavirus disease 2019 (COVID-19) from publicly reported confirmed cases: estimation and application. <em>Annals of Internal Medicine, 172</em>(9), 577-582.</li>
<li>Morawska, L., & Cao, J. (<a href="https://www.sciencedirect.com/science/article/pii/S016041202031254X">2020</a>). Airborne transmission of SARS-CoV-2: The world should face the reality. <em>Environment International</em>, 105730.</li>
<li>Nishiura, H., Oshitani, H., Kobayashi, T., Saito, T., Sunagawa, T., Matsui, T., … & Suzuki, M. (<a href="https://www.medrxiv.org/content/10.1101/2020.02.28.20029272v2">2020</a>). Closed environments facilitate secondary transmission of coronavirus disease 2019 (COVID-19). <em>medRxiv</em>.</li>
<li>Pollán, M., Pérez-Gómez, B., Pastor-Barriuso, R., Oteo, J., Hernán, M. A., Pérez-Olmeda, M., … & Molina, M. (<a href="https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(20)31483-5/fulltext">2020</a>). Prevalence of SARS-CoV-2 in Spain (ENE-COVID): a nationwide, population-based seroepidemiological study. <em>The Lancet</em>.</li>
<li>Salje, H., Kiem, C. T., Lefrancq, N., Courtejoie, N., Bosetti, P., Paireau, J., … & Le Strat, Y. (<a href="https://science.sciencemag.org/content/369/6500/208">2020</a>). Estimating the burden of SARS-CoV-2 in France. <em>Science</em>.</li>
<li>Qian, H., Miao, T., Li, L. I. U., Zheng, X., Luo, D., & Li, Y. (<a href="https://www.medrxiv.org/content/10.1101/2020.04.04.20053058v1">2020</a>). Indoor transmission of SARS-CoV-2. <em>medRxiv</em>.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>Reported deaths are more reliable than reported cases because deaths must always be reported. This is why, for example, Flaxman et al. (<a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>) use deaths to estimate the actual proportion of infections. There are issues with reported deaths, too, however, and I discuss some of them <a href="https://scienceversuscorona.com/visualising-the-covid-19-pandemic/">here</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Let me note that RIVM — the Dutch National Institute for Public Health and the Environment — has their own estimate of the number of currently infectious cases. On August $4^{\text{th}}$, their <a href="https://coronadashboard.rijksoverheid.nl/">dashboard</a> showed an estimate of $94.30$ infectious cases per $100,000$ inhabitants. This number is larger than $60.40$, the number of reported number cases per $100,000$ in Amsterdam between July $22^{\text{th}}$ and August $4^{\text{th}}$. In the terms of our calculations, their model applies a correction factor of $94.30 / 60.40 = 1.56$. RIVM is therefore slightly more optimistic than I am; for parties of size $10$, $25$, and $50$, their estimates of the probability that at least one guest is infectious — assuming guests form a random sample from the population — are $0.94\%$, $2.33\%$, and $4.61\%$, respectively. How does RIVM arrive at their estimate of the number of infectious cases? We currently do not know. <a href="https://www.rivm.nl/documenten/wekelijkse-update-epidemiologische-situatie-covid-19-in-nederland">Their weekly report</a> (Section 9.1) devotes only two small paragraphs to it, saying that the method is “still under development”. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Once the pandemic is over, inviting a random sample from the population should definitely become a thing. Bursting bubbles, one party at a time! <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Fabian DablanderThis blog post was originally published on July $22^{\text{th}}$, but was updated on August $9^{\text{th}}$ to compare the risks of partying in Amsterdam, Barcelona, and London using the most recent coronavirus case numbers. There is no doubt that, every now and then, one ought to celebrate life. This usually involves people coming together, talking, laughing, dancing, singing, shouting; simply put, it means throwing a party. With temperatures rising, summer offers all the more incentive to organize such a joyous event. Blinded by the light, it is easy to forget that we are, unfortunately, still in a pandemic. But should that really deter us? Walking around central Amsterdam after sunset, it is easy to notice that not everybody holds back. Even if my Dutch was better, it would likely still be difficult to convince groups of twenty-somethings of their potential folly. Surely, they say, it is exceedingly unlikely that this little party of ours results in any virus transmission? Government retorts by shifting perspective: while the chances of virus spreading at any one party may indeed be small, this does not licence throwing it. Otherwise many parties would mushroom, considerably increasing the chances of virus spread. Indeed, government stresses, this is why such parties remain illegal. But while if-everybody-did-what-you-did type of arguments score high with parents, they usually do no score high with their children. So instead, in this post, we ask the question from an individual’s perspective: what are the chances of getting the virus after attending this or that party? And what factors make this more or less likely? As a disclaimer, I should say that I am not an epidemiologist — who, by the way, are a more cautious bunch than I or the majority of my age group — and so my assessment of the evidence may not agree with expert opinion. With that out of the way, and without further ado, let’s dive in. Risky business? To get us started, let’s define the risk of a party as the probability that somebody who is infected with the novel coronavirus and can spread it attends the gathering. The two major factors influencing this probability are the size of the party, that is, the number of people attending the gathering; and the prevalence of infectious people in the relevant population. As we will see, the latter quantity is difficult to estimate. The probability of actually getting infected by a person who has the coronavirus depends further on a number of factors; we will discuss those in a later section. Let’s compare the risk of partying across three wonderful European cities: Amsterdam, Barcelona, and London. From July $22^{\text{nd}}$ to August $4^{\text{th}}$, a total of $563$, $3301$, and $1101$ new infections were reported (see here, here, here, and the Post Scriptum). This results in a relative case count of $64.50$, $203.72$, and $12.54$ per $100,000$ inhabitants, respectively. While these are the numbers of reported new infected cases, they are not the numbers of currently infectious cases. How do we arrive at those? Estimating the true number of infectious cases Upon infection, it usually takes a while until one can infect others, with estimates ranging from $1$ - $3$ days before showing symptoms. The incubation period is the time it takes from getting infected to showing symptoms. It lasts about $5$ days on average, with the vast majority of people showing symptoms within $12$ days (Lauer et al., 2020). Yet about a third to a half of people can be infectious without showing any symptoms (Pollán et al. 2020; He et al. 2020). Estimates suggest that one is infectious for about $8$ - $10$ days, but it can be longer. These are complications, but we need to keep it simple. Currently, visitors from outside Europe must show a negative COVID-19 test or need to self-isolate for $14$ days upon arrival in most European countries (see Austria, for an example). Let’s take these $14$ days for simplicity, and assume conservatively that this is the time one is infectious upon getting infected. Thus, we simply take the reported number of new infected cases in the last two weeks as the reported number of currently infectious cases.1 We have dealt with the first complication, but a second one immediately follows: how do we get from the reported number of infections to the true number of infections? One can estimate the true number of infections using models, or by empirically estimating the seroprevalence in the population, that is, the proportion of people who have developed antibodies. Using the first approach, Flaxman et al. (2020) estimate the total percentage of the population that has been infected — the attack rate — across $11$ European countries as of May $4^{\text{th}}$. The Netherlands was, unfortunately, not included in these estimates, and so we focus on Spain and the UK. For these countries the estimated true attack rates were $5.50\%$ and $5.10\%$, respectively. Given the population of these countries and the cumulative number of reported infections, we can compute the reported attack rate. Relating this to the estimate of the true attack rate gives us an indication of the extent that the report undercounts the actual infections; the code below calculates this for Spain and the UK. Reported deaths are more reliable than reported cases because deaths must always be reported. This is why, for example, Flaxman et al. (2020) use deaths to estimate the actual proportion of infections. There are issues with reported deaths, too, however, and I discuss some of them here. ↩Visualising the COVID-19 Pandemic2020-06-19T08:30:00+00:002020-06-19T08:30:00+00:00https://fabiandablander.com/r/Covid-Overview<p><em>This blog post first appeared on the <a href="https://scienceversuscorona.com/visualising-the-covid-19-pandemic/">Science versus Corona blog</a>. It introduces <a href="https://scienceversuscorona.shinyapps.io/covid-overview/">this Shiny app</a>.</em></p>
<p>The novel coronavirus has a firm grip on nearly all countries across the world, and there is large heterogeneity in how countries have responded to the threat.</p>
<p>Some countries, such as <a href="https://www.theguardian.com/world/2020/jun/05/brazil-coronavirus-covid-19-virus-doctor">Brazil</a> and the <a href="https://www.theguardian.com/us-news/2020/mar/28/trump-coronavirus-politics-us-health-disaster">United States</a>, have fared exceptionally poorly. Other countries, such as <a href="https://www.theatlantic.com/ideas/archive/2020/05/whats-south-koreas-secret/611215/">South Korea</a> and <a href="https://www.weforum.org/agenda/2020/05/how-germany-contained-the-coronavirus/">Germany</a>, have done exceptionally well. Many countries have faithfully executed lockdown measures, which have had an extraordinary preventive effect in saving lives (e.g., Flaxman et al., <a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>). While lockdowns have saved lives, they have had an extremely detrimental effect on rich countries such as the United Kingdom, whose <a href="https://www.bbc.com/news/business-53019360">GDP dropped by 20.4% in April</a> (see also Pichler et al., <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3606984">2020</a>), and the United States, where <a href="https://www.theguardian.com/business/2020/may/28/jobless-america-unemployment-coronavirus-in-figures">over 40 million people filed for unemployment</a>. Lockdowns have been even <a href="https://www.economist.com/international/2020/05/23/covid-19-is-undoing-years-of-progress-in-curbing-global-poverty">more devastating for developing countries</a>.</p>
<p>It is insightful to study the past course of how the virus swept across the world, and how countries have tried to fight it. But with about 8,100,000 confirmed cases, over 430,000 deaths, and many countries slowly reopening amid an accelerating pandemic, it is even more important to pay close attention now in order to learn from each other. Many excellent overviews comparing confirmed cases, deaths, and measures to curb the spread of the virus taken across countries have been produced by leading newspapers.</p>
<h2 id="visualising-the-pandemic">Visualising the Pandemic</h2>
<p>The Financial Times has been an <a href="https://www.ft.com/content/a26fbf7e-48f8-11ea-aeb3-955839e06441">excellent resource</a> of information and visualisation from the start of the pandemic. Their visualisations show, for example, that while at the start the epicenter of the pandemic has been Europe, it has shifted toward Latin America, which now accounts for most deaths. They have also <a href="https://ig.ft.com/coronavirus-lockdowns/">produced a visualisation</a> of how countries are lifting lockdown measures using the Oxford Stringency Index, produced by the Oxford COVID-19 Government Response Tracker.</p>
<p>The <a href="https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker">Oxford COVID-19 Government Response Tracker</a> collects information on different policy responses governments across the world have taken. Currently, they are tracking 17 measures taken in over 160 countries. The Oxford Stringency Index is a composite score ranging from 0 to 100 which summarizes a number of measures a country has taken (or not taken). In particular, these measures concern (1) school closures, (2) workplace closures, (3) the cancelling of public events, (4) restrictions on gatherings, (5) the closing of public transport, (6) stay at home requirements, (7) restrictions on internal movement, (8) international travel controls, and (9) public information campaigns. These measures differ in their strength, and whether they are applied generally or are targeted; for details, see Hale et al. (<a href="https://www.bsg.ox.ac.uk/research/publications/variation-government-responses-covid-19">2020a</a>). The Oxford Response Tracker is updated frequently, and now also has a Government Response Index and a Containment and Health Index (Hale et al., <a href="https://www.bsg.ox.ac.uk/research/publications/variation-government-responses-covid-19">2020a</a>).</p>
<p>The New York Times also has started producing <a href="https://www.nytimes.com/interactive/2020/world/coronavirus-maps.html">beautiful visualisations</a> that summarize how the virus is ravaging different parts of the world. I especially like their world map, which not only shows the daily confirmed cases but also the 14-day smoothed trend. Possibly inspired by <a href="https://www.endcoronavirus.org/countries">endcoronavirus.org</a>, the site also gives an overview of where cases are increasing, roughly staying the same, or decreasing. They also provide a more detailed picture of specific countries, showing for example each state and even county of the <a href="https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html">United States</a>, or the <a href="https://www.nytimes.com/interactive/2020/world/asia/india-coronavirus-cases.html">states of India</a>.</p>
<p>Finally, ourworldindata.org has what I believe are the <a href="https://ourworldindata.org/coronavirus">most comprehensive COVID-19 visualisations</a>.</p>
<h2 id="another-visualisation">Another Visualisation</h2>
<p>Inspired by <a href="https://www.politico.eu/article/europes-coronavirus-lockdown-measures-compared/">this Politico piece</a>, <a href="https://nl.linkedin.com/in/ialmi">Alexandra Rusu</a>, <a href="https://www.sowi.uni-mannheim.de/en/meiser/team/research-staff/marcel-schreiner/">Marcel Schreiner</a>, <a href="https://www.atomasevic.com/">Aleksandar Tomašević</a>, and I — joining forces through <a href="https://scienceversuscorona.com/">Science versus Corona</a> — set out to work on our own visualisation before much of the excellent work by major newspapers was available. You can find it <a href="https://scienceversuscorona.shinyapps.io/covid-overview/">here</a>. We use the wonderful <a href="https://covid19datahub.io/">covid19 R package</a> as a data source.</p>
<p>Being written in R and Shiny, our app does not approach the beauty that comes with handcrafting JavaScript; yet it shows a few useful things that some of the above visualisations lack. First, it allows you to explore the evolution of individual measures — such as closing schools and international travel controls — countries have taken instead of reporting only a composite stringency index.</p>
<p>Second, our app visualises confirmed cases and confirmed deaths jointly with the stringency index in a single figure. This allows you to explore how they evolve together, and see whether deaths in countries that lift measures quickly rise soon thereafter or not. (You might find that imposing measures causes death, <a href="https://www.tylervigen.com/spurious-correlations">ha</a>!)</p>
<p>Third, our app includes a table that lists the individual measures countries are taking, and, if they have done so, when they have lifted them. Individual rows are coloured according to how close each country is to the WHO recommendations for rolling back lockdowns (see Hale et al., <a href="https://www.bsg.ox.ac.uk/research/publications/lockdown-rollback-checklist">2020b</a>). These WHO recommendations concern whether (1) virus transmission is controlled, (2) testing, tracing, and isolation is performed adequately, (3) outbreak risk in high-risk settings is minimized, (4) preventive measures are established in workplaces, (5) risk of exporting and importing cases from high-risk areas is managed, and (6) the public is engaged, understands that this is the ‘new normal’, and understand that they have a key role in preventing an increase in cases (see WHO, <a href="https://apps.who.int/iris/handle/10665/331773">2020</a>). Data concerning (4) and (5) are not in the Oxford database; we instead use the approach outlined in Hale et al. (<a href="https://www.bsg.ox.ac.uk/research/publications/lockdown-rollback-checklist">2020b</a>).</p>
<h2 id="caveats">Caveats</h2>
<p>Importantly, there are a number of caveats associated with interpreting the data we show in the app. First, the number of confirmed cases depends strongly on the number of tests a particular country conducts. Without knowing that, it is foolish to put much trust in comparisons of cases across countries. Hasell et al. (2020) provide a data set and a visualisation of <a href="https://ourworldindata.org/coronavirus-testing">coronavirus testing</a> per country, which is measured in number of tests per confirmed case or by one over that number (the so-called positivity rate). When the number of tests carried out per confirmed case is low, a country does too little testing to adequately monitor the outbreak — the true number of infections is likely much larger.</p>
<p>Another caveat concerns deaths. Confirmed deaths provide a clearer lens into how the pandemic unfolds, as every death in a country has to be reported. This is also why e.g. Flaxman et al. (<a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>) model confirmed deaths rather than confirmed cases to assess the effect of interventions. However, using confirmed deaths to compare how successful countries are in dealing with the virus has limitations as well. Since deaths take at least a week or two to materialize, they are a window into the past, not the present; deaths are thus not a real-time indicator to decide whether to impose or lift measures.</p>
<p>There is also large variation in how deaths are reported, both across countries and within time. Some countries only count hospital deaths, for example, thus leading to an underestimate of deaths caused by COVID-19 at home. Or they include only deaths of patients that have tested positively for the virus. Authoritarian regimes might also downplay cases to look better. Moreover, due to delays in reporting, new deaths per day do not necessarily reflect the actual number of deaths that day.</p>
<p>Demographics also play an important rule; some countries are much more densely populated, providing easier transmission routes for the virus. Others, such as countries in Africa, have a much younger population, making a severe disease progression less likely (e.g., Clark et al., <a href="https://bit.ly/3hS3vWy">2020</a>); with a healthcare system that is much less advanced compared to rich nations, however, Africa may well become the next epicenter of the pandemic (Loembé et al., <a href="https://www.nature.com/articles/s41591-020-0961-x">2020</a>). All these factors make <a href="https://www.bbc.com/news/52311014">international comparisons difficult</a>.</p>
<p>A different angle on COVID-19’s toll on human life is to calculate excess deaths by subtracting, say, the average number of deaths in the previous five years in a particular time period from the number of deaths during that time period now. Unlike for confirmed deaths, <a href="https://ourworldindata.org/excess-mortality-covid">numbers on excess deaths</a> are available only for a selected number of (mostly rich) countries, and there is no central data source. <a href="https://www.economist.com/graphic-detail/2020/04/16/tracking-covid-19-excess-deaths-across-countries">The Economist</a> was one of the first outlets to visualize excess deaths; the <a href="https://www.ft.com/content/a26fbf7e-48f8-11ea-aeb3-955839e06441">Financial Times</a> and the <a href="https://www.nytimes.com/interactive/2020/04/21/world/coronavirus-missing-deaths.html">New York Times</a> provide visualisations of excess death, too.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post, I have outlined a number of excellent visualisations of the COVID19 pandemic, as well introduced <a href="https://scienceversuscorona.shinyapps.io/covid-overview/">our own</a>. <a href="https://nl.linkedin.com/in/ialmi">Alexandra Rusu</a>, <a href="https://www.sowi.uni-mannheim.de/en/meiser/team/research-staff/marcel-schreiner/">Marcel Schreiner</a>, and <a href="https://www.atomasevic.com/">Aleksandar Tomašević</a> — with whom it was an absolute pleasure working with on this — and I are planning to develop the visualisation further, including things such as number of tests, excess deaths, new Oxford indices, etc. and we encourage anybody who is interested to contribute! All the code is available on <a href="https://github.com/fdabl/Covid-Overview">Github</a>.</p>
<hr />
<p>I want to thank Alexandra Rusu, Marcel Schreiner, and Aleksandar Tomašević for a very enjoyable collaboration.</p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Clark, Jit, Warren-Gash et al. (<a href="https://bit.ly/3hS3vWy">2020</a>). Global, regional, and national estimates of the population at increased risk of severe COVID-19 due to underlying health conditions in 2020: A modelling study. <em>The Lancet</em>.</li>
<li>Flaxman, Mishra, Gandy et al. (<a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>). Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. <em>Nature</em>, 3164.</li>
<li>Loembé, M. M., Tshangela, A., Salyer, S. J., Varma, J. K., Ouma, A. E. O., & Nkengasong, J. N. (<a href="https://www.nature.com/articles/s41591-020-0961-x">2020</a>). COVID-19 in Africa: the spread and response. <em>Nature Medicine</em>, 1-4.</li>
<li>Pichler, A., Pangallo, M., del Rio-Chanona, R. M., Lafond, F., & Farmer, J. D. (<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3606984">2020</a>). Production networks and epidemic spreading: How to restart the UK economy?</li>
</ul>Fabian DablanderThis blog post first appeared on the Science versus Corona blog. It introduces this Shiny app. The novel coronavirus has a firm grip on nearly all countries across the world, and there is large heterogeneity in how countries have responded to the threat. Some countries, such as Brazil and the United States, have fared exceptionally poorly. Other countries, such as South Korea and Germany, have done exceptionally well. Many countries have faithfully executed lockdown measures, which have had an extraordinary preventive effect in saving lives (e.g., Flaxman et al., 2020). While lockdowns have saved lives, they have had an extremely detrimental effect on rich countries such as the United Kingdom, whose GDP dropped by 20.4% in April (see also Pichler et al., 2020), and the United States, where over 40 million people filed for unemployment. Lockdowns have been even more devastating for developing countries. It is insightful to study the past course of how the virus swept across the world, and how countries have tried to fight it. But with about 8,100,000 confirmed cases, over 430,000 deaths, and many countries slowly reopening amid an accelerating pandemic, it is even more important to pay close attention now in order to learn from each other. Many excellent overviews comparing confirmed cases, deaths, and measures to curb the spread of the virus taken across countries have been produced by leading newspapers. Visualising the Pandemic The Financial Times has been an excellent resource of information and visualisation from the start of the pandemic. Their visualisations show, for example, that while at the start the epicenter of the pandemic has been Europe, it has shifted toward Latin America, which now accounts for most deaths. They have also produced a visualisation of how countries are lifting lockdown measures using the Oxford Stringency Index, produced by the Oxford COVID-19 Government Response Tracker. The Oxford COVID-19 Government Response Tracker collects information on different policy responses governments across the world have taken. Currently, they are tracking 17 measures taken in over 160 countries. The Oxford Stringency Index is a composite score ranging from 0 to 100 which summarizes a number of measures a country has taken (or not taken). In particular, these measures concern (1) school closures, (2) workplace closures, (3) the cancelling of public events, (4) restrictions on gatherings, (5) the closing of public transport, (6) stay at home requirements, (7) restrictions on internal movement, (8) international travel controls, and (9) public information campaigns. These measures differ in their strength, and whether they are applied generally or are targeted; for details, see Hale et al. (2020a). The Oxford Response Tracker is updated frequently, and now also has a Government Response Index and a Containment and Health Index (Hale et al., 2020a). The New York Times also has started producing beautiful visualisations that summarize how the virus is ravaging different parts of the world. I especially like their world map, which not only shows the daily confirmed cases but also the 14-day smoothed trend. Possibly inspired by endcoronavirus.org, the site also gives an overview of where cases are increasing, roughly staying the same, or decreasing. They also provide a more detailed picture of specific countries, showing for example each state and even county of the United States, or the states of India. Finally, ourworldindata.org has what I believe are the most comprehensive COVID-19 visualisations. Another Visualisation Inspired by this Politico piece, Alexandra Rusu, Marcel Schreiner, Aleksandar Tomašević, and I — joining forces through Science versus Corona — set out to work on our own visualisation before much of the excellent work by major newspapers was available. You can find it here. We use the wonderful covid19 R package as a data source. Being written in R and Shiny, our app does not approach the beauty that comes with handcrafting JavaScript; yet it shows a few useful things that some of the above visualisations lack. First, it allows you to explore the evolution of individual measures — such as closing schools and international travel controls — countries have taken instead of reporting only a composite stringency index. Second, our app visualises confirmed cases and confirmed deaths jointly with the stringency index in a single figure. This allows you to explore how they evolve together, and see whether deaths in countries that lift measures quickly rise soon thereafter or not. (You might find that imposing measures causes death, ha!) Third, our app includes a table that lists the individual measures countries are taking, and, if they have done so, when they have lifted them. Individual rows are coloured according to how close each country is to the WHO recommendations for rolling back lockdowns (see Hale et al., 2020b). These WHO recommendations concern whether (1) virus transmission is controlled, (2) testing, tracing, and isolation is performed adequately, (3) outbreak risk in high-risk settings is minimized, (4) preventive measures are established in workplaces, (5) risk of exporting and importing cases from high-risk areas is managed, and (6) the public is engaged, understands that this is the ‘new normal’, and understand that they have a key role in preventing an increase in cases (see WHO, 2020). Data concerning (4) and (5) are not in the Oxford database; we instead use the approach outlined in Hale et al. (2020b). Caveats Importantly, there are a number of caveats associated with interpreting the data we show in the app. First, the number of confirmed cases depends strongly on the number of tests a particular country conducts. Without knowing that, it is foolish to put much trust in comparisons of cases across countries. Hasell et al. (2020) provide a data set and a visualisation of coronavirus testing per country, which is measured in number of tests per confirmed case or by one over that number (the so-called positivity rate). When the number of tests carried out per confirmed case is low, a country does too little testing to adequately monitor the outbreak — the true number of infections is likely much larger. Another caveat concerns deaths. Confirmed deaths provide a clearer lens into how the pandemic unfolds, as every death in a country has to be reported. This is also why e.g. Flaxman et al. (2020) model confirmed deaths rather than confirmed cases to assess the effect of interventions. However, using confirmed deaths to compare how successful countries are in dealing with the virus has limitations as well. Since deaths take at least a week or two to materialize, they are a window into the past, not the present; deaths are thus not a real-time indicator to decide whether to impose or lift measures. There is also large variation in how deaths are reported, both across countries and within time. Some countries only count hospital deaths, for example, thus leading to an underestimate of deaths caused by COVID-19 at home. Or they include only deaths of patients that have tested positively for the virus. Authoritarian regimes might also downplay cases to look better. Moreover, due to delays in reporting, new deaths per day do not necessarily reflect the actual number of deaths that day. Demographics also play an important rule; some countries are much more densely populated, providing easier transmission routes for the virus. Others, such as countries in Africa, have a much younger population, making a severe disease progression less likely (e.g., Clark et al., 2020); with a healthcare system that is much less advanced compared to rich nations, however, Africa may well become the next epicenter of the pandemic (Loembé et al., 2020). All these factors make international comparisons difficult. A different angle on COVID-19’s toll on human life is to calculate excess deaths by subtracting, say, the average number of deaths in the previous five years in a particular time period from the number of deaths during that time period now. Unlike for confirmed deaths, numbers on excess deaths are available only for a selected number of (mostly rich) countries, and there is no central data source. The Economist was one of the first outlets to visualize excess deaths; the Financial Times and the New York Times provide visualisations of excess death, too. Conclusion In this blog post, I have outlined a number of excellent visualisations of the COVID19 pandemic, as well introduced our own. Alexandra Rusu, Marcel Schreiner, and Aleksandar Tomašević — with whom it was an absolute pleasure working with on this — and I are planning to develop the visualisation further, including things such as number of tests, excess deaths, new Oxford indices, etc. and we encourage anybody who is interested to contribute! All the code is available on Github. I want to thank Alexandra Rusu, Marcel Schreiner, and Aleksandar Tomašević for a very enjoyable collaboration. References Clark, Jit, Warren-Gash et al. (2020). Global, regional, and national estimates of the population at increased risk of severe COVID-19 due to underlying health conditions in 2020: A modelling study. The Lancet. Flaxman, Mishra, Gandy et al. (2020). Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature, 3164. Loembé, M. M., Tshangela, A., Salyer, S. J., Varma, J. K., Ouma, A. E. O., & Nkengasong, J. N. (2020). COVID-19 in Africa: the spread and response. Nature Medicine, 1-4. Pichler, A., Pangallo, M., del Rio-Chanona, R. M., Lafond, F., & Farmer, J. D. (2020). Production networks and epidemic spreading: How to restart the UK economy?Interactive exploration of COVID-19 exit strategies2020-06-11T10:30:00+00:002020-06-11T10:30:00+00:00https://fabiandablander.com/r/Covid-Exit<p>The COVID-19 pandemic will end only when a sufficient number of people have become immune, thus preventing future outbreaks. Principally, so-called <em>exit strategies</em> differ on whether immunity is achieved through natural infections, or whether it is achieved through a vaccine. Countries across the world are scrambling to find an adequate exit strategy, with <a href="https://www.endcoronavirus.org/countries">differential success</a>.</p>
<p>To model different exit strategies from an epidemiological standpoint, de Vlas & Coffeng (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>) developed a stochastic individual-based SEIR model which allows for inter-individual differences in how effectively individuals spread the virus and how well individuals adhere to measures designed to curb virus transmission. The model also allows for preferential mixing of individuals with similar contact rates. A key innovation of the model is that it stratifies the population into communities and regions within which transmission mainly occurs. Their paper is excellent and insightful, and I encourage you to read it.</p>
<p>To make the underlying model more easily accessible, <a href="https://twitter.com/luc_coffeng">Luc Coffeng</a> and I have developed a <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a> that allows you to explore these exit strategies interactively. In this blog post, I provide a brief overview of the Shiny app and ideas about possible model extensions. Note that I am not an epidemiologist, and my aim here is not to endorse different exit strategies nor to make policy recommendations.</p>
<p>This work was carried out under the umbrella of <a href="https://scienceversuscorona.com/">Science versus Corona</a>, an initiative I founded together with <a href="https://dennyborsboom.com/">Denny Borsboom</a>, <a href="https://twitter.com/tfblanken">Tessa Blanken</a>, and <a href="https://twitter.com/CharlotteCTanis">Charlotte Tanis</a>.</p>
<h1 id="modeling-exit-strategies">Modeling exit strategies</h1>
<p>The two figures below illustrate particular parameterizations of five different exit strategies: Radical Opening, Phased Lift of Control, Intermittent Lockdown, Flattening the Curve, and Contact Tracing. The first four strategies aim for a (controlled) build-up of herd immunity through natural infection, while Contact Tracing aims to minimize cases until a vaccine is available.</p>
<p>Since the model is stochastic, that is, events in the simulations occur randomly according to pre-defined probabilities, the black solid lines in the figures below show a number of possible trajectories. Note that the dashed vertical lines below indicate interventions, with lines before day 0 indicating interventions specific to the Netherlands during the initial lockdown, and lines from day 0 onwards interventions that are specific to the exit strategies.</p>
<!-- Each figure comprises four panels, showing --- per million --- (a) the number of infectious cases, (b) the number of new cases in intensive care per day, (c) the number of cases present in intensive care (IC), and (d) the proportion of people having recovered. The horizontal dashed line in (a) and (c) indicate the intensive care capacity for the Netherlands; the horizontal line in (d) indicates the herd immunity threshold. The red dots show intensive care data for the Netherlands. Since the model is stochastic, repeated simulations follow slightly different trajectories, as indicated by the black solid lines. -->
<div style="text-align:center;">
<img src="../assets/img/exit-strategies-I.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="850" height="950" />
</div>
<p>Radical Opening lifts all measures at once on day 0, resulting in a huge increase in the number of infections per million, as the top panel shows. The dashed vertical line indicates the number of infections at which the intensive care capacity is reached in the Netherlands, which is 6000 infections per million inhabitants. The second panel shows the simulated number of new cases in intensive care per day, with the red dots showing the actual number of cases in intensive care in the Netherlands. The third panel shows the number of cases that are present in intensive care per million; the dashed vertical line indicates the number of beds per million — 115 — that are available for COVID-19 cases in the Netherlands. Radical Opening massively overshoots this capacity, which would result in a large number of excess deaths. The bottom panel shows that herd immunity is reached quickly, yet <a href="https://www.nytimes.com/2020/05/01/opinion/sunday/coronavirus-herd-immunity.html">overshoots</a>.</p>
<p>Phased Lift of Control, as proposed by de Vlas & Coffeng (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>), splits a country into geographical units and, one at a time, lifts the measures in that part; the time points at which measures are lifted is indicated by the vertical dotted lines. Phased Lift of Control as presented here does not lead to an overburdening of the healthcare system and thus in no excess death as compared to Radical Opening (note the $y$-axis difference). However, the strategy still aims at achieving herd immunity naturally, and so depending on who exactly gets infected, there will be deaths proportional to the case fatality ratio of that subpopulation. Phased Lift of Control allows a natural epidemic within the region where measures are being lifted, and so it overshoots herd immunity regionally and therefore nationally as well, as seen in the bottom panel. As a side note, overshoot does not occur when 25% of the participants “remain in hiding” when control measures are lifted (Luc Coffeng, personal communication), which strikes me as a realistic scenario; overall, Phased Lift of Control is robust to this non-participation (see <a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2.supplementary-material">Supplementary 3</a> in de Vlas & Coffeng, 2020).</p>
<p>The intention of Intermittent Lockdown is to reinstate lockdown measures just before intensive care units are at full capacity. Compared to Phased Lift of Control, the Intermittent Lockdown exit strategy does not use the intensive care capacity efficiently, as some intensive care beds remain unused during periods of lockdown (see days 200 - 600). Moreover, the strategy comes with a high risk of overshooting intensive care capacity (see days 0 - 200 and days 600 - 750).</p>
<div style="text-align:center;">
<img src="../assets/img/exit-strategies-II.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="750" height="550" />
</div>
<p>Flattening the Curve aims to balance the number of infections so that the healthcare system does not become overburdened by relaxing interventions after an initial lockdown. If not enough interventions are lifted (as in this example), herd immunity hardly develops (e.g., see day 400). Conversely, if too many interventions are lifted (or people adhere poorly to interventions), case numbers may increase beyond health care capacity (e.g., see day 500). As the bottom panel shows, this version of Flattening the Curve does not reach herd immunity even after 1200 days.</p>
<p>In contrast to all strategies so far, the Contact Tracing exit strategy does not aim for natural herd immunity. Instead, it aims to keep the number of infections low until a vaccine is developed, with vaccine development being a <a href="https://www.nytimes.com/interactive/2020/06/09/magazine/covid-vaccine.html">highly complex undertaking</a> that may take years. Until that point, due to the low proportion of people who have acquired immunity, large outbreaks are possible at all times, and this is indeed what the figure above shows. There is some debate on how well the testing, tracing, and isolating of infectious and exposed cases will <a href="https://www.sciencemag.org/news/2020/05/countries-around-world-are-rolling-out-contact-tracing-apps-contain-coronavirus-how">work in practice</a>, and you can play around with these parameters in the Shiny app. Heterogeneity might work in our favour, however. Recent estimates suggest that the spread of the novel coronavirus is <a href="https://www.sciencemag.org/news/2020/05/why-do-some-covid-19-patients-infect-many-others-whereas-most-don-t-spread-virus-all">largely driven by superspreading events</a> (see also Althouse et al. <a href="https://arxiv.org/abs/2005.13689">2020</a>), which has <a href="https://www.nytimes.com/2020/06/02/opinion/coronavirus-superspreaders.html">ramifications for control</a>. Heterogeneity in networks that connect individuals can also increase the efficiency of contact tracing (Kojaku et al., <a href="https://arxiv.org/abs/2005.02362">2020</a>).</p>
<p>The <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a> describes these exit strategies and their different parameterizations in more detail, and allows you to interactively compare variations of them. Except Radical Opening, all exit strategies that aim at herd immunity presented above take an extraordinary amount of time to reach it. Indeed, <a href="https://mrc-ide.github.io/covid19estimates/#/total-infected">modeling suggests</a>, and recent seroprevalence studies confirm, that <a href="https://www.nytimes.com/interactive/2020/05/28/upshot/coronavirus-herd-immunity.html">we are far from herd immunity</a>. I am not espousing these types of exit strategies here, and they make me feel a little uneasy (compare <a href="https://medium.com/@tomaspueyo/coronavirus-should-we-aim-for-herd-immunity-like-sweden-b1de3348e88b">the case of Sweden</a>). An assessment of these and other exit strategies that do not aim at herd immunity through natural infection requires input from multiple disciplines, and goes far beyond this blog post and the Shiny app. The goal of the <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a> is instead to allow you to see how robust various exit strategies are to changes in their parameters, and how they compare to each other from a purely epidemiological standpoint.</p>
<h1 id="model-extensions">Model extensions</h1>
<p>The modeling work by de Vlas & Coffeng (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>) is impressive, and I again encourage you to read up on it; see especially their <a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2.supplementary-material">Supplementary 1</a>. Here, I want to briefly mention a number of interesting dimensions along which the model could be extended, with some being more realistic than others.</p>
<p>First, the model currently assumes life-long immunity (or at least for the duration of the simulation), which is unrealistic. Depending on the exact duration of immunity, the dynamics of the exit strategies simulations presentated above will change. For an investigation of how seasonality and immunity might influence the course of the pandemic, see Kissler et al. (<a href="https://science.sciencemag.org/content/368/6493/860">2020</a>).</p>
<p>Second, the model currently does not stratify the population according to age, the most important risk factor for mortality. Extending the model in this way would allow one to model interventions targeted at a particular age group, as well as assess mortalities in a more detailed manner. The model currently also does not simulate mortality, and they have to be computed using the number of infections and an estimate of the case fatality ratio. Needless to say, if the prevalence of people who require intensive care exceeds the intensive care capacity, mortalities will be much higher.</p>
<p>Third, the model assumes that individuals live in clusters (e.g., villages), which a part of super clusters (e.g., provinces), which together make up a country. It allows for heterogeneity among contact rates and preferential mixing of individuals with similar contact behaviour, but currently does not incorporate an explicit network structure. Instead, it assumes that, barring very strong preferential mixing, every individual is connected to every other individual. Adding a network structure would result in more realistic assessment of interventions such as contact tracing, with potentially large ramifications (e.g., Kojaku et al., <a href="https://arxiv.org/abs/2005.02362">2020</a>).</p>
<p>Fourth, the exit strategies presented above are somewhat monolithic. Except for Radical Opening and Contact Tracing, they work by reducing the transmission over a particular period of time in which measures are taken place. Contact Tracing is slightly more involved, and you can read more details in the <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a>. This coarse-grained approach ignores the finer-grained choices governments have to make; should schools be re-opened? What about hairdressers and church services? International travel? A more detailed exploration of the effect of exit strategies would associate each such intervention with a reduction in transmission, and simulate what would happen when they are being lifted or enforced. Needless to say, this requires a good understanding of how such interventions reduce virus spread (see e.g., Chu et al., <a href="https://www.sciencedirect.com/science/article/pii/S0140673620311429">2020</a>), an understanding we are currently lacking. <a href="https://theconversation.com/lockdown-we-need-to-experiment-with-reopenings-now-to-prevent-a-second-wave-138741">Systematic experimentation</a> might help.</p>
<h1 id="multidisciplinary-assessment">Multidisciplinary assessment</h1>
<p>Lastly, the pandemic affects not only the physical health of citizens, but has also inflicted severe economic and psychological damage. While models that focus on a single aspect of the pandemic can yield valuable insights, they should ideally combine different disciplinary perspectives to provide a holistic assessment of exit strategies. Recently, various works have combined economic and epidemiological modeling. For example, using the UK as a case study, Pichler et al. (<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3606984">2020</a>) compare strategies that differ in which sectors they would reopen; even radical opening would reduce the GDP by 16 percentage points compared to pre-lockdown levels, all the while keeping the effective reproductive number $R_t$ above 1.</p>
<p>But there are other disciplines who could chip in besides epidemiology and economics, such as psychology, law, and history. Some would provide a quantitative assessment, for example by formalizing the effect of different interventions such as opening schools or closing churches. What are the epidemiological effects of opening schools? How do school closures adversely affect the educational development of children? In what ways do they increase existing economic inequalities? Others would provide a more qualitative assessment. For example, what are the legal ramifications of “protecting the elderly”, which sounds sensible but has a discriminatory undertone? From a historical perspective, what lessons can we learn from citizens’ behaviour – such as <a href="https://www.theguardian.com/world/2020/apr/29/coronavirus-pandemic-1918-protests-california">anti-mask protests</a> — in past pandemics? All these interventions and effects interact in complex ways, severely complicating analysis; but who said it would be easy?</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, I have described a <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a> which allows you to interactively explore different exit strategies using the epidemiological model described in de Vlas & Coffeng (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>). I have discussed potential model extensions and the need for a multidisciplinary assessment of exit strategies. Overall, the modeling suggests that exit strategies aimed at the controlled build-up of immunity will take a long time; but so might be <a href="https://www.nytimes.com/interactive/2020/science/coronavirus-vaccine-tracker.html">waiting for a vaccine</a>. Best to brace for the long haul.</p>
<hr />
<p>I want to thank Luc Coffeng for an insightful collaboration and valuable comments on this blog post. Thanks also to Denny Borsboom, Tessa Blanken, and Charlotte Tanis for helpful comments on this blog post and for being a great team.</p>
<hr />
<p><em>This blog post has also been posted to the <a href="https://scienceversuscorona.com/interactive-exploration-of-covid-19-exit-strategies">Science versus Corona blog</a>.</em></p>
<h2 id="references">References</h2>
<ul>
<li>Althouse, B. M., Wenger, E. A., Miller, J. C., Scarpino, S. V., Allard, A., Hébert-Dufresne, L., & Hu, H. (<a href="https://arxiv.org/abs/2005.13689">2020</a>). Stochasticity and heterogeneity in the transmission dynamics of SARS-CoV-2. <em>arXiv preprint arXiv:2005.13689</em>.</li>
<li>Chu, D. K., Akl, E. A., Duda, S., Solo, K., Yaacoub, S., Schünemann, H. J., … & Hajizadeh, A. (2020). Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: A systematic review and meta-analysis. <em>The Lancet</em>.</li>
<li>de Vlas, S. J., & Coffeng, L. E. (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>). A phased lift of control: a practical strategy to achieve herd immunity against Covid-19 at the country level. <em>medRxiv</em>.</li>
<li>Flaxman, Mishra, Gandy et al. (<a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>) Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. <em>Nature</em>, 3164.</li>
<li>Kojaku, S., Hébert-Dufresne, L., & Ahn, Y. Y. (<a href="https://arxiv.org/abs/2005.02362">2020</a>). The effectiveness of contact tracing in heterogeneous networks. <em>arXiv preprint arXiv:2005.02362</em>.</li>
<li>Pichler, A., Pangallo, M., del Rio-Chanona, R. M., Lafond, F., & Farmer, J. D. (<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3606984">2020</a>). Production networks and epidemic spreading: How to restart the UK economy?</li>
</ul>Fabian DablanderThe COVID-19 pandemic will end only when a sufficient number of people have become immune, thus preventing future outbreaks. Principally, so-called exit strategies differ on whether immunity is achieved through natural infections, or whether it is achieved through a vaccine. Countries across the world are scrambling to find an adequate exit strategy, with differential success. To model different exit strategies from an epidemiological standpoint, de Vlas & Coffeng (2020) developed a stochastic individual-based SEIR model which allows for inter-individual differences in how effectively individuals spread the virus and how well individuals adhere to measures designed to curb virus transmission. The model also allows for preferential mixing of individuals with similar contact rates. A key innovation of the model is that it stratifies the population into communities and regions within which transmission mainly occurs. Their paper is excellent and insightful, and I encourage you to read it. To make the underlying model more easily accessible, Luc Coffeng and I have developed a Shiny app that allows you to explore these exit strategies interactively. In this blog post, I provide a brief overview of the Shiny app and ideas about possible model extensions. Note that I am not an epidemiologist, and my aim here is not to endorse different exit strategies nor to make policy recommendations. This work was carried out under the umbrella of Science versus Corona, an initiative I founded together with Denny Borsboom, Tessa Blanken, and Charlotte Tanis. Modeling exit strategies The two figures below illustrate particular parameterizations of five different exit strategies: Radical Opening, Phased Lift of Control, Intermittent Lockdown, Flattening the Curve, and Contact Tracing. The first four strategies aim for a (controlled) build-up of herd immunity through natural infection, while Contact Tracing aims to minimize cases until a vaccine is available. Since the model is stochastic, that is, events in the simulations occur randomly according to pre-defined probabilities, the black solid lines in the figures below show a number of possible trajectories. Note that the dashed vertical lines below indicate interventions, with lines before day 0 indicating interventions specific to the Netherlands during the initial lockdown, and lines from day 0 onwards interventions that are specific to the exit strategies. Radical Opening lifts all measures at once on day 0, resulting in a huge increase in the number of infections per million, as the top panel shows. The dashed vertical line indicates the number of infections at which the intensive care capacity is reached in the Netherlands, which is 6000 infections per million inhabitants. The second panel shows the simulated number of new cases in intensive care per day, with the red dots showing the actual number of cases in intensive care in the Netherlands. The third panel shows the number of cases that are present in intensive care per million; the dashed vertical line indicates the number of beds per million — 115 — that are available for COVID-19 cases in the Netherlands. Radical Opening massively overshoots this capacity, which would result in a large number of excess deaths. The bottom panel shows that herd immunity is reached quickly, yet overshoots. Phased Lift of Control, as proposed by de Vlas & Coffeng (2020), splits a country into geographical units and, one at a time, lifts the measures in that part; the time points at which measures are lifted is indicated by the vertical dotted lines. Phased Lift of Control as presented here does not lead to an overburdening of the healthcare system and thus in no excess death as compared to Radical Opening (note the $y$-axis difference). However, the strategy still aims at achieving herd immunity naturally, and so depending on who exactly gets infected, there will be deaths proportional to the case fatality ratio of that subpopulation. Phased Lift of Control allows a natural epidemic within the region where measures are being lifted, and so it overshoots herd immunity regionally and therefore nationally as well, as seen in the bottom panel. As a side note, overshoot does not occur when 25% of the participants “remain in hiding” when control measures are lifted (Luc Coffeng, personal communication), which strikes me as a realistic scenario; overall, Phased Lift of Control is robust to this non-participation (see Supplementary 3 in de Vlas & Coffeng, 2020). The intention of Intermittent Lockdown is to reinstate lockdown measures just before intensive care units are at full capacity. Compared to Phased Lift of Control, the Intermittent Lockdown exit strategy does not use the intensive care capacity efficiently, as some intensive care beds remain unused during periods of lockdown (see days 200 - 600). Moreover, the strategy comes with a high risk of overshooting intensive care capacity (see days 0 - 200 and days 600 - 750). Flattening the Curve aims to balance the number of infections so that the healthcare system does not become overburdened by relaxing interventions after an initial lockdown. If not enough interventions are lifted (as in this example), herd immunity hardly develops (e.g., see day 400). Conversely, if too many interventions are lifted (or people adhere poorly to interventions), case numbers may increase beyond health care capacity (e.g., see day 500). As the bottom panel shows, this version of Flattening the Curve does not reach herd immunity even after 1200 days. In contrast to all strategies so far, the Contact Tracing exit strategy does not aim for natural herd immunity. Instead, it aims to keep the number of infections low until a vaccine is developed, with vaccine development being a highly complex undertaking that may take years. Until that point, due to the low proportion of people who have acquired immunity, large outbreaks are possible at all times, and this is indeed what the figure above shows. There is some debate on how well the testing, tracing, and isolating of infectious and exposed cases will work in practice, and you can play around with these parameters in the Shiny app. Heterogeneity might work in our favour, however. Recent estimates suggest that the spread of the novel coronavirus is largely driven by superspreading events (see also Althouse et al. 2020), which has ramifications for control. Heterogeneity in networks that connect individuals can also increase the efficiency of contact tracing (Kojaku et al., 2020). The Shiny app describes these exit strategies and their different parameterizations in more detail, and allows you to interactively compare variations of them. Except Radical Opening, all exit strategies that aim at herd immunity presented above take an extraordinary amount of time to reach it. Indeed, modeling suggests, and recent seroprevalence studies confirm, that we are far from herd immunity. I am not espousing these types of exit strategies here, and they make me feel a little uneasy (compare the case of Sweden). An assessment of these and other exit strategies that do not aim at herd immunity through natural infection requires input from multiple disciplines, and goes far beyond this blog post and the Shiny app. The goal of the Shiny app is instead to allow you to see how robust various exit strategies are to changes in their parameters, and how they compare to each other from a purely epidemiological standpoint. Model extensions The modeling work by de Vlas & Coffeng (2020) is impressive, and I again encourage you to read up on it; see especially their Supplementary 1. Here, I want to briefly mention a number of interesting dimensions along which the model could be extended, with some being more realistic than others. First, the model currently assumes life-long immunity (or at least for the duration of the simulation), which is unrealistic. Depending on the exact duration of immunity, the dynamics of the exit strategies simulations presentated above will change. For an investigation of how seasonality and immunity might influence the course of the pandemic, see Kissler et al. (2020). Second, the model currently does not stratify the population according to age, the most important risk factor for mortality. Extending the model in this way would allow one to model interventions targeted at a particular age group, as well as assess mortalities in a more detailed manner. The model currently also does not simulate mortality, and they have to be computed using the number of infections and an estimate of the case fatality ratio. Needless to say, if the prevalence of people who require intensive care exceeds the intensive care capacity, mortalities will be much higher. Third, the model assumes that individuals live in clusters (e.g., villages), which a part of super clusters (e.g., provinces), which together make up a country. It allows for heterogeneity among contact rates and preferential mixing of individuals with similar contact behaviour, but currently does not incorporate an explicit network structure. Instead, it assumes that, barring very strong preferential mixing, every individual is connected to every other individual. Adding a network structure would result in more realistic assessment of interventions such as contact tracing, with potentially large ramifications (e.g., Kojaku et al., 2020). Fourth, the exit strategies presented above are somewhat monolithic. Except for Radical Opening and Contact Tracing, they work by reducing the transmission over a particular period of time in which measures are taken place. Contact Tracing is slightly more involved, and you can read more details in the Shiny app. This coarse-grained approach ignores the finer-grained choices governments have to make; should schools be re-opened? What about hairdressers and church services? International travel? A more detailed exploration of the effect of exit strategies would associate each such intervention with a reduction in transmission, and simulate what would happen when they are being lifted or enforced. Needless to say, this requires a good understanding of how such interventions reduce virus spread (see e.g., Chu et al., 2020), an understanding we are currently lacking. Systematic experimentation might help. Multidisciplinary assessment Lastly, the pandemic affects not only the physical health of citizens, but has also inflicted severe economic and psychological damage. While models that focus on a single aspect of the pandemic can yield valuable insights, they should ideally combine different disciplinary perspectives to provide a holistic assessment of exit strategies. Recently, various works have combined economic and epidemiological modeling. For example, using the UK as a case study, Pichler et al. (2020) compare strategies that differ in which sectors they would reopen; even radical opening would reduce the GDP by 16 percentage points compared to pre-lockdown levels, all the while keeping the effective reproductive number $R_t$ above 1. But there are other disciplines who could chip in besides epidemiology and economics, such as psychology, law, and history. Some would provide a quantitative assessment, for example by formalizing the effect of different interventions such as opening schools or closing churches. What are the epidemiological effects of opening schools? How do school closures adversely affect the educational development of children? In what ways do they increase existing economic inequalities? Others would provide a more qualitative assessment. For example, what are the legal ramifications of “protecting the elderly”, which sounds sensible but has a discriminatory undertone? From a historical perspective, what lessons can we learn from citizens’ behaviour – such as anti-mask protests — in past pandemics? All these interventions and effects interact in complex ways, severely complicating analysis; but who said it would be easy? Conclusion In this blog post, I have described a Shiny app which allows you to interactively explore different exit strategies using the epidemiological model described in de Vlas & Coffeng (2020). I have discussed potential model extensions and the need for a multidisciplinary assessment of exit strategies. Overall, the modeling suggests that exit strategies aimed at the controlled build-up of immunity will take a long time; but so might be waiting for a vaccine. Best to brace for the long haul. I want to thank Luc Coffeng for an insightful collaboration and valuable comments on this blog post. Thanks also to Denny Borsboom, Tessa Blanken, and Charlotte Tanis for helpful comments on this blog post and for being a great team. This blog post has also been posted to the Science versus Corona blog. References Althouse, B. M., Wenger, E. A., Miller, J. C., Scarpino, S. V., Allard, A., Hébert-Dufresne, L., & Hu, H. (2020). Stochasticity and heterogeneity in the transmission dynamics of SARS-CoV-2. arXiv preprint arXiv:2005.13689. Chu, D. K., Akl, E. A., Duda, S., Solo, K., Yaacoub, S., Schünemann, H. J., … & Hajizadeh, A. (2020). Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: A systematic review and meta-analysis. The Lancet. de Vlas, S. J., & Coffeng, L. E. (2020). A phased lift of control: a practical strategy to achieve herd immunity against Covid-19 at the country level. medRxiv. Flaxman, Mishra, Gandy et al. (2020) Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature, 3164. Kojaku, S., Hébert-Dufresne, L., & Ahn, Y. Y. (2020). The effectiveness of contact tracing in heterogeneous networks. arXiv preprint arXiv:2005.02362. Pichler, A., Pangallo, M., del Rio-Chanona, R. M., Lafond, F., & Farmer, J. D. (2020). Production networks and epidemic spreading: How to restart the UK economy?Infectious diseases and nonlinear differential equations2020-03-22T12:30:00+00:002020-03-22T12:30:00+00:00https://fabiandablander.com/r/Nonlinear-Infection<p>Last summer, I wrote about <a href="https://fabiandablander.com/r/Linear-Love.html">love affairs and linear differential equations</a>. While the topic is cheerful, linear differential equations are severely limited in the types of behaviour they can model. In this blog post, which I spent writing in self-quarantine to prevent further spread of SARS-CoV-2 — take that, cheerfulness — I introduce nonlinear differential equations as a means to model infectious diseases. In particular, we will discuss the simple SIR and SIRS models, the building blocks of many of the more complicated models used in epidemiology.</p>
<p>Before doing so, however, I discuss some of the basic tools of nonlinear dynamics applied to the logistic equation as a model for population growth. If you are already familiar with this, you can skip ahead. If you have had no prior experience with differential equations, I suggest you first check out my <a href="https://fabiandablander.com/r/Linear-Love.html">earlier post</a> on the topic.</p>
<p>I should preface this by saying that I am not an epidemiologist, and that no analysis I present here is specifically related to the current SARS-CoV-2 pandemic, nor should anything I say be interpreted as giving advice or making predictions. I am merely interested in differential equations, and as with love affairs, infectious diseases make a good illustrating case. So without further ado, let’s dive in!</p>
<h1 id="modeling-population-growth">Modeling Population Growth</h1>
<p>Before we start modeling infectious diseases, it pays to study the concepts required to study nonlinear differential equations on a simple example: modeling population growth. Let $N > 0$ denote the size of a population and assume that its growth depends on itself:</p>
\[\frac{dN}{dt} = \dot{N} = r N \enspace .\]
<p>As shown in a <a href="https://fabiandablander.com/r/Linear-Love.html">previous blog post</a>, this leads to exponential growth for $r > 0$:</p>
\[N(t) = N_0 e^{r t} \enspace ,\]
<p>where $N_0 = N(0)$ is the initial population size at time $t = 0$. The figure below visualizes the differential equation (left panel) and its solution (right panel) for $r = 1$ and an initial population of $N_0 = 2$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>This is clearly not a realistic model since the growth of a population depends on resources, which are finite. To model finite resources, we write:</p>
\[\dot{N} = rN \left(1 - \frac{N}{K}\right) \enspace ,\]
<p>where $r > 0$ and $K$ is the so-called <em>carrying capacity</em>, that is, the maximum sized population that can be sustained by the available resources. Observe that as $N$ grows and if $K > N$, then $(1 - N / K)$ gets smaller, slowing down the growth rate $\dot{N}$. If on the other hand $N > K$, then the population needs more resources than are available, and the growth rate becomes negative, resulting in population decrease.</p>
<p>For simplicity, let $K = 1$ and interpret $N \in [0, 1]$ as the proportion with respect to the carrying capacity; that is, $N = 1$ implies that we are at carrying capacity. The figure below visualizes the differential equation and its solution for $r = 1$ and an initial condition $N_0 = 0.10$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In contrast to exponential growth, the logistic equation leads to sigmoidal growth which approaches the carrying capacity. This is much more interesting behaviour than the linear differential equation above allows. In particular, the logistic equation has two <em>fixed points</em> — points at which the population neither increases nor decreases but stays fixed, that is, where $\dot{N} = 0$. These occur at $N = 0$ and at $N = 1$, as can be inferred from the left panel in the figure above.</p>
<h2 id="analyzing-the-stability-of-fixed-points">Analyzing the Stability of Fixed Points</h2>
<p>What is the stability of these fixed points? Intuitively, $N = 0$ should be unstable; if there are individuals, then they procreate and the population increases. Similarly, $N = 1$ should be stable: if $N < 1$, then $\dot{N} > 0$ and the population grows towards $N = 1$, and if $N > 1$, then $\dot{N} < 0$ and individuals die until $N = 1$.</p>
<p>To make this argument more rigorous, and to get a more quantitative assessment of how quickly perturbations move away from or towards a fixed point, we derive a differential equation for these small perturbations close to the fixed point (see also Strogatz, 2015, p. 24). Let $N^{\star}$ denote a fixed point and define $\eta(t) = N(t) - N^{\star}$ to be a small perturbation close to the fixed point. We derive a differential equation for $\eta$ by writing:</p>
\[\frac{d\eta}{dt} = \frac{d}{dt}\left(N(t) - N^{\star}\right) = \frac{dN}{dt} \enspace ,\]
<p>since $N^{\star}$ is a constant. This implies that the dynamics of the perturbation equal the dynamics of the population. Let $f(N)$ denote the differential equation for $N$, observe that $N = N^{\star} + \eta$ such that $\dot{N} = \dot{\eta} = f(N) = f(N^{\star} + \eta)$. Recall that $f$ is a nonlinear function, and nonlinear functions are messy to deal with. Thus, we simply pretend that the function is linear close to the fixed point. More precisely, we approximate $f$ around the fixed point using a Taylor series (see <a href="https://www.youtube.com/watch?v=3d6DsjIBzJ4">this excellent video</a> for details) by writing:</p>
\[f(N^{\star} + \eta) = f(N^{\star}) + \eta f'(N^{\star}) + \mathcal{O}(\eta^2) \enspace ,\]
<p>where we have ignored higher order terms. Note that, by definition, there is no change at the fixed point, that is, $f(N^{\star}) = 0$. Assuming that $f’(N^{\star}) \neq 0$ — as otherwise the higher-order terms matter, as there would be nothing else — we have that close to a fixed point</p>
\[\dot{\eta} \approx \eta f'(N^{\star}) \enspace ,\]
<p>which is a linear differential equation with solution:</p>
\[\eta(t) = \eta_0 e^{f'(N^{\star})t} \enspace .\]
<p>Using this trick, we can assess the stability of $N^{\star}$ as follows. If $f’(N^{\star}) < 0$, the small perturbation $\eta(t)$ around the fixed point decays towards zero, and so the system returns to the fixed point — the fixed point is stable. On the other hand, if $f’(N^{\star}) > 0$, then the small perturbation $\eta(t)$ close to the fixed point grows, and so the system does not return to the fixed point — the fixed point is unstable. Applying this to our logistic equation, we see that:</p>
\[\begin{aligned}
f'(N) &= \frac{d}{dN} \left(rN(1 - N)\right) \\[0.50em]
&= \frac{d}{dN} \left(rN - rN^2\right) \\[0.50em]
& = r - 2rN \\[0.50em]
&= r(1 - 2N) \enspace .
\end{aligned}\]
<p>Plugging in our two fixed points $N^{\star} = 0$ and $N^{\star} = 1$, we find that $f’(0) = r$ and $f’(1) = -r$. Since $r > 0$, this confirms our suspicion that $N^{\star} = 0$ is unstable and $N^{\star} = 1$ is stable. In addition, this analysis tells us how quickly the perturbations grow or decay; for the logistic equation, this is given by $r$.</p>
<p>In sum, we have linearized a nonlinear system close to fixed points in order to assess the stability of these fixed points, and how quickly perturbations close to these fixed points grow or decay. This technique is called <em>linear stability analysis</em>. In the next two sections, we discuss two ways to solve differential equations using the logistic equation as an example.</p>
<h2 id="analytic-solution">Analytic Solution</h2>
<p>In contrast to linear differential equations, which was the topic of a <a href="https://fabiandablander.com/r/Linear-Love.html">previous blog post</a>, nonlinear differential equations can usually not be solved analytically; that is, we generally cannot get an expression that, given an initial condition, tells us the state of the system at any time point $t$. The logistic equation can, however, be solved analytically and it might be instructive to see how. We write:</p>
\[\begin{aligned}
\frac{dN}{dt} &= rN (1 - N) \\
\frac{dN}{N(1 - N)} &= r dt \\
\int \frac{1}{N(1 - N)} dN &= r t \enspace .
\end{aligned}\]
<p>Staring at this for a bit, we realize that we can use partial fractions to split the integral. We write:</p>
\[\begin{aligned}
\int \frac{1}{N(1 - N)} dN &= r t \\[0.50em]
\int \frac{1}{N} dN + \int \frac{1}{1 - N}dN &= rt \\[0.50em]
\text{log}N - \text{log}(1 - N) + Z &= rt \\[0.50em]
e^{\text{log}N - \text{log}(1 - N) + Z} &= e^{rt} \enspace .
\end{aligned}\]
<p>The exponents and the logs cancel each other nicely. We write:</p>
\[\begin{aligned}
\frac{e^{\text{log}N}}{e^{\text{log}(1 - N)}}e^Z &= e^{rt} \\[0.50em]
\frac{N}{1 - N} e^Z &= e^{rt} \\[0.50em]
\frac{N}{1 - N} &= e^{rt - Z} \\[0.50em]
N &= e^{rt - Z} - N e^{rt - Z} \\[0.50em]
N\left(1 + e^{rt - Z}\right) &= e^{rt - Z} \\[0.50em]
N &= \frac{e^{rt - Z}}{1 + e^{rt - Z}} \enspace .
\end{aligned}\]
<p>One last trick is to multiply by $e^{-rt + Z}$, which yields:</p>
\[N = \frac{\left(e^{-rt + Z}\right)\left(e^{rt - Z}\right)}{\left(e^{-rt + Z}\right) + {\left(e^{-rt + Z}\right)\left(e^{-rt + Z}\right)}} = \frac{1}{1 + e^{-rt + Z}} \enspace ,\]
<p>where $Z$ is the constant of integration. To solve for it, we need the initial condition. Suppose that $N(0) = N_0$, which, using the third line in the derivation above and the fact that $t = 0$, leads to:</p>
\[\begin{aligned}
\text{log}N_0 - \text{log}(1 - N_0) + Z &= 0 \\[0.50em]
\text{log}N_0 - \text{log}(1 - N_0) &= -Z \\[0.50em]
\frac{N_0}{1 - N_0} = e^{-Z} \\[0.50em]
\frac{1 - N_0}{N_0} = e^{Z} \enspace .
\end{aligned}\]
<p>Plugging this into our solution from above yields:</p>
\[N(t) = \frac{1}{1 + e^{-rt + Z}} = \frac{1}{1 + \frac{1 - N_0}{N_0} e^{-rt}} \enspace .\]
<p>While this was quite a hassle, other nonlinear differential equations are much, much harder to solve, and most do not admit a closed-form solution — or at least if they do, the resulting expression is generally not very intuitive. Luckily, we can compute the time-evolution of the system using numerical methods, as illustrated in the next section.</p>
<h2 id="numerical-solution">Numerical Solution</h2>
<p>A differential equation implicitly encodes how the system we model changes over time. Specifically, given a particular (potentially high-dimensional) state of the system at time point $t$, $\mathbf{x}_t$, we know in which direction and how quickly the system will change because this is exactly what is encoded in the differential equation $f = \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t}$. This suggests the following numerical approximation: Assume we know the state of the system at a (discrete) time point $n$, denoted $x_n$, and that the change in the system is constant over a small interval $\Delta_t$. Then, the position of the system at time point $n + 1$ is given by:</p>
\[\mathbf{x}_{n + 1} = \mathbf{x}_n + \Delta t \cdot f(\mathbf{x}_n) \enspace .\]
<p>$\Delta t$ is an important parameter, encoding over what time period we assume the change $f$ to be constant. We can code this up in R for the logistic equation:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_logistic</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N0</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">N0</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">)</span><span class="w">
</span><span class="n">dN</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="p">)</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">N</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># Euler</span><span class="w">
</span><span class="n">N</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">N</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dN</span><span class="p">(</span><span class="n">N</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">])</span><span class="w">
</span><span class="c1"># Improved Euler</span><span class="w">
</span><span class="c1"># k <- N[i-1] + delta_t * dN(N[i-1])</span><span class="w">
</span><span class="c1"># N[i] <- N[i-1] + 1 /2 * delta_t * (dN(N[i-1]) + dN(k))</span><span class="w">
</span><span class="c1"># Runge-Kutta 4th order</span><span class="w">
</span><span class="c1"># k1 <- dN(N[i-1]) * delta_t</span><span class="w">
</span><span class="c1"># k2 <- dN(N[i-1] + k1/2) * delta_t</span><span class="w">
</span><span class="c1"># k3 <- dN(N[i-1] + k2/2) * delta_t</span><span class="w">
</span><span class="c1"># k4 <- dN(N[i-1] + k3) * delta_t</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># N[i] <- N[i-1] + 1/6 * (k1 + 2*k2 + 2*k3 + k4)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">N</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Clearly, the accuracy of this approximation is a function of $\Delta t$. To see how, the left panel shows the approximation for various values of $\Delta t$, while the right panel shows the (log) absolute error as a function of (log) $\Delta t$. The error is defined as:</p>
\[E = |N(10) - \hat{N}(10)| \enspace ,\]
<p>where $\hat{N}$ is the Euler approximation.</p>
<!-- The figure gives some intuition how the accuracy of the approximation changes as we change $\Delta_t$ and the approximation method. In particular, the left panel shows the Euler approximation for various $\Delta t$, while the right panel shows the approximation for the Runga-Kutta method (see commented out code above). -->
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>The right panel approximately shows the relationship:</p>
\[\begin{aligned}
\text{log } E &\propto \text{log } \Delta t \\[0.50em]
E &\propto \Delta t \enspace .
\end{aligned}\]
<p>Therefore, the error goes down linearly with $\Delta t$. Other methods, such as the improved Euler method or <a href="https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods">Runge-Kutta solvers</a> (see commented out code above) do better. However, it is ill-advised to choose $\Delta t$ extremely small, because this leads to an increase in computation time and can lead to accuracy errors which get exacerbated over time.</p>
<!-- We see that the [Runge-Kutta method](https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods) (of $4^{\text{th}}$ order) performs better. While the figure shows that the error is drastically reduced with smaller step sizes $\Delta t$, it is ill-advised to choose $\Delta t$ extremely small: Decreasing $\Delta t$ leads to comparatively more computations, and this increases computation time but also can lead to accuracy errors which get exacerbated over time. -->
<p>In summary, we have seen that nonlinear differential equations can model interesting behaviour such as multiple fixed points; how to classify the stability of these fixed points using linear stability analysis; and how to numerically solve nonlinear differential equations. In the remainder of this post, we study coupled nonlinear differential equations — the SIR and SIRS models — as a way to model the spread of infectious diseases.</p>
<h1 id="modeling-infectious-diseases">Modeling Infectious Diseases</h1>
<p>Many models have been proposed as tools to understand epidemics. In the following sections, I focus on the two simplest ones: the SIR and the SIRS model (see also Hirsch, Smale, Devaney, 2013, ch. 11).</p>
<h2 id="the-sir-model">The SIR Model</h2>
<p>We use the SIR model to understand the spread of infectious diseases. The SIR model is the most basic <em>compartmental</em> model, meaning that it groups the overall population into distinct sub-populations: a susceptible population $S$, an infected population $I$, and a recovered population $R$. We make a number of further simplifying assumptions. First, we assume that the overall population is $1 = S + I + R$ so that $S$, $I$, and $R$ are proportions. We further assume that the overall population does not change, that is,</p>
\[\frac{d}{dt} \left(S + I + R\right) = 0 \enspace .\]
<p>Second, the SIR model assumes that once a person has been infected and has recovered, the person cannot become infected again — we will relax this assumption later on. Third, the model assumes that the rate of transmission of the disease is proportional to the number of encounters between susceptible and infected persons. We model this by setting</p>
\[\frac{dS}{dt} = - \beta IS \enspace ,\]
<p>where $\beta > 0$ is the rate of infection. Fourth, the model assumes that the growth of the recovered population is proportional to the proportion of people that are infected, that is,</p>
\[\frac{dR}{dt} = \gamma I \enspace ,\]
<p>where the $\gamma > 0$ is the recovery rate. Since the overall population is constant, these two equations naturally lead to the following equation for the infected:</p>
\[\begin{aligned}
\frac{d}{dt} \left(S + I + R\right) = 0 \\[0.50em]
\frac{dI}{dt} = - \frac{dS}{dt} - \frac{dR}{dt} \\[0.50em]
\frac{dI}{dt} = \beta IS - \gamma I \enspace .
\end{aligned}\]
<p>where $\beta I S$ gives the proportion of newly infected individuals and $\gamma I$ gives the proportion of newly recovered individuals. Observe that since we assumed that the overall population does not change, we only need to focus on two of these subgroup, since $R(t) = 1 - S(t) - I(t)$. The system is therefore fully characterized by</p>
\[\begin{aligned}
\frac{dS}{dt} &= - \beta IS \\[0.50em]
\frac{dI}{dt} &= \beta IS - \gamma I \enspace .
\end{aligned}\]
<p>Before we analyze this model mathematically, let’s implement Euler’s method and visualize some trajectories.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_SIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="w">
</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">dimnames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Time'</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">S0</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dS</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">i</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot_SIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="s1">'Set1'</span><span class="p">)</span><span class="w">
</span><span class="n">time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w">
</span><span class="n">matplot</span><span class="p">(</span><span class="w">
</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">],</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Subpopulations'</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Days'</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tail</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)),</span><span class="w">
</span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.75</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w">
</span><span class="n">xaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">yaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="w">
</span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="m">0.65</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">),</span><span class="w">
</span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below shows trajectories for a fixed recovery rate of $\gamma = 1/8$ and an increasing rate of infection $\beta$ for the initial condition $S_0 = 0.95$, $I_0 = 0.05$, and $R_0 = 0$. We take a time step $\Delta t = 1$ to denote one day. (Unfortunately, epidemics take much longer in real life.)</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>For $\beta = 1/8$, no outbreak occurs (left panel). Instead, the proportion of susceptible and infected people monotonically decrease while the proportion of recovered people monotonically increases. The middle panel, on the other hand, shows a small outbreak. The proportion of infected people rises, but then falls again. Similarly, the right panel shows an outbreak as well, but a more severe one, as the proportion of infected people rises more starkly before it eventually decreases again.</p>
<p>How do things change when we change the recovery rate $\gamma$? The figure below shows again three cases of trajectories for the same initial condition, but for a smaller recovery rate $\gamma = 1/12$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>We again observe no outbreak in the left panel, and outbreaks of increasing severity in both the middle and the right panel. In contrast to the results for $\gamma = 1/8$, the outbreak is more severe, as we would expect since the recovery rate with $\gamma = 1/12$ is now lower. In fact, whether an outbreak occurs or not and how severe it will be depends not on $\beta$ and $\gamma$ alone, but on their ratio. This ratio is known as $R_0 = \beta / \gamma$, pronounced “R-naught”. (Note the unfortunate choice of well-established terminology in this context, as $R_0$ also denotes the initial proportion of recovered people; it should be clear from the context which one is meant, however.) We can think of $R_0$ as the average number of people an infected person will infect before she gets better (assuming a population that is fully susceptible). If $R_0 > 1$, an outbreak occurs. In the next section, we look for the fixed points of this system and assess their stability.</p>
<h2 id="analyzing-fixed-points">Analyzing Fixed Points</h2>
<p>A glance at the above figures suggests that the SIR model allows for multiple stable states. The left panels, for example, show that if there is no outbreak, the proportion of susceptible people stays above the proportion of recovered people. If there is an outbreak, however, then it always fades and the proportion of recovered people will be higher than the proportion of susceptible people; how much higher depends on the severity of the outbreak.</p>
<p>While we could play around some more with visualisations, it pays to do a formal analysis. Note that in contrast to the logistic equation, which only modelled a single variable — population size — an analysis of the SIR model requires us to handle two variables, $S$ and $I$; the third one, $R$, follows from the assumption of a constant population size. At the fixed points, nothing changes, that is, we have:</p>
\[\begin{aligned}
0 &= - \beta IS \\[0.50em]
0 &= \beta IS - \gamma I \enspace .
\end{aligned}\]
<p>This can only happen when $I = 0$, irrespective of the value of $S$. In other words, all $(I^{\star}, S^{\star}) = (0, S)$ are fixed points; if nobody is infected, the disease cannot spread — and so everybody stays either susceptible or recovered. To assess the stability of these fixed points, we again derive a differential equation for the perturbations close to the fixed point. However, note that in contrast to the one-dimensional case studied above, perturbations can now be with respect to $I$ or to $S$. Let $u = S - S^{\star}$ and $v = I - I^{\star}$ be the respective perturbations, and let $\dot{S} = f(S, I)$ and $\dot{I} = g(S, I)$. We first derive a differential equation for $u$, writing:</p>
\[\dot{u} = \frac{d}{dt}\left(S - S^{\star}\right) = \dot{S} \enspace ,\]
<p>since $S^{\star}$ is a constant. This implies that $u$ behaves as $S$. In contrast to the one-dimensional case above, we have two <em>coupled</em> differential equations, and so we have to take into account how $u$ changes as a function of both $S$ and $I$. We Taylor expand at the fixed point $(S^{\star}, I^{\star})$:</p>
\[\begin{aligned}
\dot{u} &= f(u + S^{\star}, v + I^{\star}) \\[0.50em]
&= f(S^{\star}, I^{\star}) + u \frac{\partial f}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial f}{\partial I}_{(S^{\star}, I^{\star})} + \mathcal{O}(u^2, v^2, uv) \\[0.50em]
&\approx u \frac{\partial f}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial f}{\partial I}_{(S^{\star}, I^{\star})} \enspace ,
\end{aligned}\]
<p>since $f(S^{\star}, I^{\star}) = 0$ and we drop higher-order terms. Note that taking the partial derivative of $f$ with respect to $S$ (or $I$) yields a function, and the subscripts $(S^{\star}, I^{\star})$ mean that we evaluate this function at the fixed point $(S^{\star}, I^{\star})$. We can similarly derive a differential equation for $v$:</p>
\[\dot{v} \approx u \frac{\partial g}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial g}{\partial I}_{(S^{\star}, I^{\star})} \enspace .\]
<p>We can write all of this concisely using matrix algebra:</p>
\[\begin{pmatrix}
\dot{u} \\
\dot{v}
\end{pmatrix} =
\begin{pmatrix}
\frac{\partial f}{\partial S} & \frac{\partial f}{\partial I} \\
\frac{\partial g}{\partial S} & \frac{\partial g}{\partial I}
\end{pmatrix}_{(S^{\star}, I^{\star})}
\begin{pmatrix}
u \\
v
\end{pmatrix} \enspace ,\]
<p>where</p>
\[J = \begin{pmatrix}
\frac{\partial f}{\partial S} & \frac{\partial f}{\partial I} \\
\frac{\partial g}{\partial S} & \frac{\partial g}{\partial I}
\end{pmatrix}_{(S^{\star}, I^{\star})}\]
<p>is called the <em>Jacobian matrix</em> at the fixed point $(S^{\star}, I^{\star})$. The Jacobian gives the linearized dynamics close to a fixed point, and therefore tells us how perturbations will evolve close to a fixed point.</p>
<p>In contrast to unidimensional systems, where we simply check whether the slope is positive or negative, that is, whether $f’(x^\star) < 0$ or $f’(x^\star) > 0$, the test for whether a fixed point is stable is slightly more complicated in multidimensional settings. In fact, and not surprisingly, since we have <em>linearized</em> this nonlinear differential equation, the check is the same as in <a href="https://fabiandablander.com/r/Linear-Love.html">linear systems</a>: we compute the eigenvalues $\lambda_1$ and $\lambda_2$ of $J$, observing that negative eigenvalues mean exponential decay and positive eigenvalues mean exponential growth along the directions of the respective eigenvectors. (Note that this does not work for all types of fixed points, see Strogatz (2015, p. 152).)</p>
<p>What does this mean for our SIR model? First, let’s derive the Jacobian:</p>
\[\begin{aligned}
J &= \begin{pmatrix}
-\frac{\partial}{\partial S} \beta I S & -\frac{\partial }{\partial I} \beta I S \\
\frac{\partial}{\partial S} \left(\beta I S - \gamma I\right) & \frac{\partial}{\partial I} \left(\beta I S - \gamma I\right) \\[0.5em]
\end{pmatrix} \\[1em]
& =
\begin{pmatrix}
-\beta I & -\beta S \\
\beta I & \beta S - \gamma
\end{pmatrix} \enspace .
\end{aligned}\]
<p>Evaluating this at the fixed point $(S^{\star}, I^{\star}) = (S, 0)$ results in:</p>
\[J_{(S, 0)} = \begin{pmatrix} 0 & -\beta S \\ 0 & \beta S - \gamma \end{pmatrix} \enspace .\]
<p>Since this matrix is upper triangular — all entries below the diagonal are zero — the eigenvalues are given by the diagonal, that is, $\lambda_1 = 0$ and $\lambda_2 = \beta S - \gamma$. $\lambda_1 = 0$ implies a constant solution, while $\lambda_2 > 0$ implies exponential growth and $\lambda_2 < 0$ exponential decay of the perturbations close to the fixed point. Observe that $\lambda_2$ is not only a function of the parameters $\beta$ and $\gamma$, but also of the proportion of susceptible individuals $S$. We find that $\lambda_2 > 0$ for $S > \gamma / \beta$, which results in an unstable fixed point. On the other hand, we have that $\lambda_2 < 0$ for $S < \gamma / \beta$, which results in a stable fixed point. In the next section, we will use vector fields in order to get more intuition for the dynamics of the system.</p>
<h2 id="vector-field-and-nullclines">Vector Field and Nullclines</h2>
<p>A vector field shows for any position $(S, I)$ in which direction the system moves, which we indicate by the head of an arrow, and how quickly, which we indicate by the length of an arrow. We use the R code below to visualize such a vector field and selected trajectories on it.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'fields'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_vectorfield_SIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">expand.grid</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">))</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">SI</span><span class="p">[</span><span class="n">apply</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="c1"># S + I <= 1 must hold</span><span class="w">
</span><span class="n">dSI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">dS</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
</span><span class="n">draw_vectorfield</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">draw_vectorfield</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="w">
</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w">
</span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.2</span><span class="p">,</span><span class="w"> </span><span class="m">1.2</span><span class="p">),</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-0.1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">arrow.plot</span><span class="p">(</span><span class="w">
</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w">
</span><span class="n">arrow.ex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.075</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.05</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray82'</span><span class="p">,</span><span class="w"> </span><span class="n">xpd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">cx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="n">cn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">1.05</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">-.075</span><span class="p">,</span><span class="w"> </span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cn</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.05</span><span class="p">,</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cn</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.03</span><span class="p">,</span><span class="w"> </span><span class="m">-.04</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cx</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.03</span><span class="p">,</span><span class="w"> </span><span class="m">.975</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cx</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">0.995</span><span class="p">,</span><span class="w"> </span><span class="m">-0.04</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cx</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>For $\beta = 1/8$ and $\gamma = 1/8$, we know from above that no outbreak occurs. The vector field shown in the left panel below further illustrates that, since $S \leq \gamma / \beta = 1$, all fixed points $(S^{\star}, I^{\star}) = (S, 0)$ are stable. In contrast, we know that $\beta = 3/8$ and $\gamma = 1/8$ result in an outbreak. The vector field shown in the right panel below indicates that fixed points with $S > \gamma / \beta = 1/3$ are unstable, while fixed points with $S < 1/3$ are stable; the dotted line is $S = 1/3$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-9-1.png" title="plot of chunk unnamed-chunk-9" alt="plot of chunk unnamed-chunk-9" style="display: block; margin: auto;" /></p>
<p>Can we find some structure in such vector fields? One way to “organize” them is by drawing so-called <em>nullclines</em>. In our case, the $I$-nullcline gives the set of points for which $\dot{I} = 0$, and the $S$-nullcline gives the set of points for which $\dot{S} = 0$. We find these points in a similar manner to finding fixed points, but instead of setting both $\dot{S}$ and $\dot{I}$ to zero, we tackle them one at a time.</p>
<p>The $S$-nullclines are given by the $S$- and the $I$-axes, because $\dot{S} = 0$ when $S = 0$ or when $I = 0$. Along the $I$-axis axis we have $\dot{I} = - \gamma I$ since $S = 0$, resulting in exponential decay of the infected population; this indicated by the grey arrows along the $I$-axis which are of progressively smaller length the closer they approach the origin.</p>
<p>The $I$-nullclines are given by $I = 0$ and by $S = \gamma / \beta$. For $I = 0$, we have $\dot{S} = 0$ and so these yield fixed points. For $S = \gamma / \beta$ we have $\dot{S} = - \gamma I$, resulting in exponential decay of the susceptible population, but since $\dot{I} = 0$, the proportion of infected people does not change; this is indicated in the left vector field above, where we have horizontal arrows at the dashed line given by $S = \gamma / \beta$. However, this only holds for the briefest of moments, since $S$ decreases and for $S < \gamma / \beta$ we again have $\dot{I} < 0$, and so the proportion of infected people goes down to the left of the line. Similarly, to the right of the line we have $S > \gamma / \beta$, which results in $\dot{I} > 0$, and so the proportion of infected people grows.</p>
<p>In summary, we have seen how the SIR model allows for outbreaks whenever the rate of infection is higher than the rate of recovery, $R_0 > \beta / \gamma$. If this occurs, then we have a growing proportion of infected people while $S > \gamma / \beta$. As illustratd by the vector field, the proportion of susceptible people $S$ decreases over time. At some point, therefore, we have that $S < \gamma / \beta$, resulting in a decrease in the proportion of infected people until finally $I = 0$. Observe that, in the SIR model, infections always die out. In the next section, we extend the SIR model to allow for diseases to become established in the population.</p>
<!-- The figure below shows the vector field for $\beta = 4$ and $\gamma = 1$; the nullclines are given by the black solid lines. As predicted, for any $S_0 > 1/4$ an epidemic occurs, that is, the number of infected people grows. After passing $S = 1/4$, the number of infected people decreases until it reaches a fixed point where $I = 0$. -->
<!-- ```{r, echo = FALSE, warning = FALSE, fig.align = 'center', fig.width = 8, fig.height = 8, dpi=400} -->
<!-- par(mar = c(0, 0, 0, 0)) -->
<!-- b <- 4/8 -->
<!-- g <- 1/8 -->
<!-- plot_vectorfield_SIR(beta = b, gamma = g, main = expression(beta ~ ' = 4/8,' ~ gamma ~ ' = 1/8')) -->
<!-- plot_trajectory_SIR(0.95, 0.01, beta = b, gamma = g) -->
<!-- plot_trajectory_SIR(0.8, 0.01, beta = b, gamma = g) -->
<!-- plot_trajectory_SIR(0.65, 0.01, beta = b, gamma = g) -->
<!-- plot_trajectory_SIR(0.5, 0.01, beta = b, gamma = g) -->
<!-- lines(c(1/4, 1/4), c(0, 1), lty = 2, lwd = 1) -->
<!-- # stable <- seq(0, g/b - .05, .05) -->
<!-- # unstable <- seq(g/b, 1, .05) -->
<!-- # points(x = unstable, y = rep(0, length(unstable)), cex = 1.3) -->
<!-- # points(x = seq(g/b, 1, .05), y = rep(0, length(unstable)), cex = 1.5, pch = 20, col = 'white') -->
<!-- # points(x = stable, y = rep(0, length(stable)), pch = 20, cex = 1.5) -->
<!-- ``` -->
<h2 id="the-sirs-model">The SIRS Model</h2>
<p>The SIR model assumes that once infected people are immune to the disease forever, and so any disease occurs only once and then never comes back. More interesting dynamics occur when we allow for the reinfection of recovered people; we can then ask, for example, under what circumstances the disease becomes established in the population. The SIRS model extends the SIR model, allowing the recovered population to become susceptible again (hence the extra ‘S’). It assumes that the susceptible population increases proportional to the recovered population such that:</p>
\[\begin{aligned}
\frac{dS}{dt} &= - \beta IS + \mu R \\[0.50em]
\frac{dI}{dt} &= \beta IS - \gamma I \\[0.50em]
\frac{dR}{dt} &= \gamma I - \mu R\enspace ,
\end{aligned}\]
<p>where, since we added $\mu R$ to the change in the proportion of susceptible people, we had to subtract $\mu R$ from the change in the proportion of recovered people. We again make the simplifying assumption that the overall population does not change, and so it suffices to study the following system:</p>
\[\begin{aligned}
\frac{dS}{dt} &= - \beta IS + \mu R \\[0.50em]
\frac{dI}{dt} &= \beta IS - \gamma I \enspace ,
\end{aligned}\]
<p>since $R(t) = 1 - S(t) - I(t)$. We adjust our implementation of Euler’s method:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_SIRS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="w">
</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">dimnames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Time'</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">S0</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">R</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">R</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dS</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">delta_t</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot_SIRS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="s1">'Set1'</span><span class="p">)</span><span class="w">
</span><span class="n">time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w">
</span><span class="n">matplot</span><span class="p">(</span><span class="w">
</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">],</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Subpopulations'</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Days'</span><span class="p">,</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.75</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w">
</span><span class="n">font.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">xaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">yaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tail</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="w">
</span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="m">0.95</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">),</span><span class="w">
</span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below shows trajectories for a fixed recovery rate of $\gamma = 1/8$, a fixed reinfection rate of $\mu = 1/8$, and an increasing rate of infection $\beta$ for the initial condition $S_0 = 0.95$, $I_0 = 0.05$, and $R_0 = 0$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<p>As for the SIR model, we again find that no outbreak occurs for $R_0 = \beta / \gamma < 1$, which is the case for the left panel. Most interestingly, however, we find that the proportion of infected people <em>does not</em>, in contrast to the SIR model, decrease to zero for the other panels. Instead, the disease becomes established in the population when $R_0 > 1$, and the middle and the right panel show different fixed points.</p>
<p>How do things change when we vary the reinfection rate $\mu$? The figure below shows again three cases of trajectories for the same initial condition, but for a smaller reinfection rate $\mu$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>We again find no outbreak in the left panel, and outbreaks of increasing severity in the middle and right panel. Both these outbreaks are less severe compared to the outbreaks in the previous figures, as we would expect given a decrease in the reinfection rate. Similarly, the system seems to stabilize at different fixed points. In the next section, we provide a more formal analysis of the fixed points and their stability.</p>
<h2 id="analyzing-fixed-points-1">Analyzing Fixed Points</h2>
<p>To find the fixed points of the SIRS model, we again seek solutions for which:</p>
\[\begin{aligned}
0 &= - \beta IS + \mu (1 - S - I) \\[0.50em]
0 &= \beta IS - \gamma I \enspace ,
\end{aligned}\]
<p>where we have substituted $R = 1 - S - I$ and from which it follows that also $\dot{R} = 0$ since we assume that the overall population does not change. We immediately see that, in contrast to the SIR model, $I = 0$ cannot be a fixed point for <em>any</em> $S$ because of the added term which depends on $\mu$. Instead, it is a fixed point only for $S = 1$. To get the other fixed point, note that the last equation gives $S = \gamma / \beta$, which plugged into the first equation yields:</p>
\[\begin{aligned}
0 &= -I\gamma + \mu\left(1 - \frac{\gamma}{\beta} - I\right) \\[0.50em]
I\gamma &= \mu\left(1 - \frac{\gamma}{\beta}\right) - \mu I \\[0.50em]
I(\gamma + \mu) &= \mu\left(1 - \frac{\gamma}{\beta}\right) \\[0.50em]
I &= \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} \enspace .
\end{aligned}\]
<p>Therefore, the fixed points are:</p>
\[\begin{aligned}
(S^{\star}, I^{\star}) &= (1, 0) \\[0.50em]
(S^{\star}, I^{\star}) &= \left(\frac{\gamma}{\beta}, \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right) \enspace .
\end{aligned}\]
<p>Note that the second fixed point does not exist when $\gamma / \beta > 1$, since the proportion of infected people cannot be negative. Another, more intuitive perspective on this is to write $\gamma / \beta > 1$ as $R_0 = \beta / \gamma < 1$. This allows us to see that the second fixed point, which would have a non-zero proportion of infected people in the population, does not exist when $R_0 < 1$, as then no outbreak occurs. We will come back to this in a moment.</p>
<p>To assess the stability of the fixed points, we derive the Jacobian matrix for the SIRS model:</p>
\[\begin{aligned}
J &= \begin{pmatrix}
\frac{\partial}{\partial S} \left(-\beta I S + \mu(1 - S - I)\right) & \frac{\partial }{\partial I} \left(-\beta I S + \mu(1 - S - I)\right) \\
\frac{\partial}{\partial S} \left(\beta I S - \gamma I\right) & \frac{\partial}{\partial I} \left(\beta I S - \gamma I\right) \\[0.5em]
\end{pmatrix} \\[1em]
&=
\begin{pmatrix}
-\beta I - \mu & -\beta S - \mu \\
\beta I & \beta S - \gamma
\end{pmatrix} \enspace .
\end{aligned}\]
<p>For the fixed point $(S^{\star}, I^{\star}) = (1, 0)$ we have:</p>
\[J_{(1, 0)} = \begin{pmatrix}
- \mu & -\beta - \mu \\
0 & \beta - \gamma
\end{pmatrix} \enspace ,\]
<p>which is again upper-triangular and therefore has eigenvalues $\lambda_1 = -\mu$ and $\lambda_2 = \beta - \gamma$. This means it is unstable whenever $\beta > \gamma$ since then $\lambda_2 > 0$, and any infected individual spreads the disease. The Jacobian at the second fixed point is:</p>
\[J_{\left(\frac{\gamma}{\beta}, \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right)} = \begin{pmatrix}
-\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} - \mu & -\gamma - \mu \\
\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} & - 2\gamma
\end{pmatrix} \enspace ,\]
<p>which is more daunting. However, we know from the previous blog post that to classify the stability of the fixed point, it suffices to look at the trace $\tau$ and determinant $\Delta$ of the Jacobian, which are given by</p>
\[\begin{aligned}
\tau &= -\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} - 2\gamma \\[0.50em]
\Delta &= \left(-\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right)\left(-2\gamma\right) - \left(- \gamma - \mu\right)\left(\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right) \\[0.50em]
&= 2\gamma\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} + \beta\mu\left(1 - \frac{\gamma}{\beta}\right) \enspace .
\end{aligned}\]
<p>The trace can be written as $\tau = \lambda_1 + \lambda_2$ and the determinant can be written as $\Delta = \lambda_1 \lambda_2$, as shown in a <a href="https://fabiandablander.com/r/Linear-Love.html">previous blog post</a>. Here, we have that $\tau < 0$ because both terms above are negative, and $\Delta > 0$ because both terms above are positive. This constrains $\lambda_1$ and $\lambda_2$ to be negative, and thus the fixed point is stable.</p>
<h2 id="vector-fields-and-nullclines">Vector Fields and Nullclines</h2>
<p>As previously done for the SIR model, we can again visualize the directions in which the system changes at any point using a vector field.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_vectorfield_SIRS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">expand.grid</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">))</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">SI</span><span class="p">[</span><span class="n">apply</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="c1"># S + I <= 1 must hold</span><span class="w">
</span><span class="n">dSI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">dS</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
</span><span class="n">draw_vectorfield</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below visualizes the vector field for the SIRS model, several trajectories, and the nullclines for $\gamma = 1/8$ and $\mu = 1/8$ for $\beta = 1/8$ (left panel) and $\beta = 3/8$ (right panel). The left panel shows that there exists only one stable fixed point at $(S^{\star}, I^{\star}) = (1, 0)$ to which all trajectories converge.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-14-1.png" title="plot of chunk unnamed-chunk-14" alt="plot of chunk unnamed-chunk-14" style="display: block; margin: auto;" /></p>
<p>The right panel, on the other hand, shows <em>two</em> fixed points: one unstable fixed point at $(S^{\star}, I^{\star}) = (1, 0)$, which we only reach when $I_0 = 0$, and a stable one at</p>
\[(S^{\star}, I^{\star}) = \left(\frac{1/8}{3/8}, \frac{1/8\left(1 - \frac{3/8}{1/8}\right)}{1/8 + 1/8}\right) = (1/3, 1/3) \enspace .\]
<p>In contrast to the SIR model, therefore, there exists a stable fixed point constituting a population which includes infected people, and so the disease is not eradicated but stays in the population.</p>
<p>The dashed lines give the nullclines. The $I$-nullcline gives the set of points where $\dot{I} = 0$, which are — as in the SIR model above — given by $I = 0$ and $S = \gamma / \beta$. The $S$-nullcline is given by:</p>
\[\begin{aligned}
0 &= - \beta I S + \mu(1 - S - I) \\[0.50em]
\beta I S &= \mu(1 - S) - \mu I \\[0.50em]
I &= \frac{\mu(1 - S)}{\beta S + \mu} \enspace ,
\end{aligned}\]
<p>which is a nonlinear function in $S$. The nullclines help us again in “organizing” the vector field. This can be seen best in the right panel above. In particular, and similar to the SIR model, we will again have a decrease in the proportion of infected people to the left of the line given by $S = \gamma / \beta$, that is, when $S < \gamma / \beta$, and an increase to the right of the line, that is, when $S > \gamma / \beta$. Similarly, the proportion of susceptible people increases when the system is “below” the $S$-nullcline, while it increases when the system is “above” the $S$-nullcline.</p>
<h2 id="bifurcations">Bifurcations</h2>
<p>In the vector fields above we have seen that the system can go from having only one fixed point to having two fixed points. Whenever a fixed point is destroyed or created or changes its stability as an internal parameter is varied — here the ratio of $\gamma / \beta$ — we speak of a <em>bifurcation</em>.</p>
<p>As pointed out above, the second equilibrium point only exists for $\gamma / \beta \leq 1$. As long as $\gamma / \beta < 1$, we have two distinct fixed points. At $\gamma / \beta = 1$, the second fixed point becomes:</p>
\[\begin{aligned}
(S^{\star}, I^{\star}) &= \left(1, \frac{\mu\left(1 - 1\right)}{\gamma + \mu}\right) = (1, 0) \enspace ,
\end{aligned}\]
<p>which equals the first fixed point. Thus, at $\gamma / \beta = 1$, the two fixed points merge into one; this is the bifurcation point. This makes sense: if $\gamma / \beta < 1$, we have that $\beta / \gamma > 1$, and so an outbreak occurs, which establishes the disease in the population since we allow for reinfections.</p>
<p>We can visualize this change in fixed points in a so-called <em>bifurcation diagram</em>. A bifurcation diagram shows how the fixed points and their stability change as we vary an internal parameter. Since we deal with two-dimensional fixed points, we split the bifurcation diagram into two: the left panel shows how the $I^{\star}$ part of the fixed point changes as we vary $\gamma / \beta$, and the right panel shows how the $S^{\star}$ part of the fixed point changes as we vary $\gamma / \beta$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-15-1.png" title="plot of chunk unnamed-chunk-15" alt="plot of chunk unnamed-chunk-15" style="display: block; margin: auto;" /></p>
<p>The left panel shows that as long as $\gamma / \beta < 1$, which implies that $\beta / \gamma > 1$, we have two fixed points where the stable fixed point is the one with a non-zero proportion of infected people — the disease becomes established. These fixed points are on the diagonal line, indicates as black dots. Interestingly, this shows that the proportion of infected people can never be stable at a value larger than $1/2$. There also exist unstable fixed points for which $I^{\star} = 0$. These fixed points are unstable because if there even exists only one infected person, she will spread the disease, resulting in more infected people. At the point where $\beta = \gamma$, the two fixed points merge: the disease can no longer be established in the population, and the proportion of infected people always goes to zero.</p>
<p>Similarly, the right panel shows how the fixed points $S^{\star}$ change as a function of $\gamma / \beta$. Since the infection spreads for $\beta > \gamma$, the fixed point $S^{\star} = 1$ is unstable, as the proportion of susceptible people must decrease since they become infected. For outbreaks that become increasingly mild as $\gamma / \beta \rightarrow 1$, the stable proportion of susceptible people increases, reaching $S^{\star} = 1$ when at last $\gamma = \beta$.</p>
<p>In summary, we have seen how the SIRS extends the SIR model by allowing reinfections. This resulted in possibility of more interesting fixed points, which included a non-zero proportion of infected people. In the SIRS model, then, a disease can become established in the population. In contrast to the SIR model, we have also seen that the SIRS model allows for bifuractions, going from two fixed points in times of outbreaks ($\beta > \gamma$) to one fixed point in times of no outbreaks ($\beta < \gamma$).</p>
<!-- model allows for outbreaks whenever the rate of infection is higher than the rate of recovery, $R_0 > \beta / \gamma$. If this occurs, then we have a growing proportion of infected people when $S > \gamma / \beta$. As illustratd by the vector field, the proportion of susceptible people $S$ decreases over time. At some point, therefore, we have that $S < \gamma / \beta$, resulting in a decrease in the proportion of infected people until finally $I = 0$. Observe that, in the SIR model, infections always die out. In the next section, we extend the SIR model to allow for diseases to become established in the population. -->
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen that nonlinear differential equations are a powerful tool to model real-world phenomena. They allow us to model vastly more complicated behaviour than is possible with linear differential equations, yet they rarely provide closed-form solution. Luckily, the time-evolution of a system can be straightforwardly computed with basic numerical techniques such as Euler’s method. Using the simple logistic equation, we have seen how to analyze the stability of fixed points — simply pretend the system is linear close to a fixed point.</p>
<p>The logistic equation has only one state variable — the size of the population. More interesting dynamics occur when variables interact, and we have seen how the simple SIR model can help us understand the spread of infectious disease. Consisting only of two parameters, we have seen that an outbreak occurs only when $R_0 = \beta / \gamma > 1$. Moreover, the stable fixed points always included $I = 0$, implying that the disease always gets eradicated. This is not true for all diseases because recovered people might become reinfected. The SIRS model amends this by introducing a parameter $\mu$ that quantifies how quickly recovered people can become susceptible again. As expected, this led to stable states in which the disease becomes established in the population.</p>
<p>On our journey to understand these systems, we have seen how to quantify the stability of a fixed point using linear stability analysis, how to visualize the dynamics of a system using vector fields, how nullclines give structure to such vector fields, and how bifurcations can drastically change the dynamics of a system.</p>
<p>The SIR and the SIRS models discussed here are without a doubt crude approximations of the real dynamics of the spread of infectious diseases. There exist <a href="https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#Elaborations_on_the_basic_SIR_model">several ways to extend them</a>. One way to do so, for example, is to add an <em>exposed</em> population which are infected but are not yet infectious; see <a href="https://gabgoh.github.io/COVID/index.html">here</a> for a visualization of an elaborated version of this model in the context of SARS-CoV-2. These basic compartment models assume homogeneity of spatial-structure, which is a substantial simplification. There are various ways to include spatial structure (e.g., Watts, 2005; Riley, 2007), but that is for another blog post.</p>
<hr />
<p>I would like to thank <a href="https://twitter.com/theBonferroni">Adam Finnemann</a>, <a href="https://twitter.com/AnToniPichler">Anton Pichler</a>, and <a href="https://twitter.com/Oisin_Ryan_">Oísin Ryan</a> for very helpful comments on this blog post.</p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Strogatz, S. H. (<a href="http://www.stevenstrogatz.com/books/nonlinear-dynamics-and-chaos-with-applications-to-physics-biology-chemistry-and-engineering">2015</a>). Nonlinear Dynamics and Chaos: With applications to Physics, Biology, Chemistry, and Engineering. Colorado, US: Westview Press.</li>
<li>Hirsch, M. W., Smale, S., & Devaney, R. L. (<a href="https://books.google.nl/books?hl=en&lr=&id=rly1AAmAXh8C&oi=fnd&pg=PP1&dq=differential+equations+hirsch+smale&ots=pbe8hf2vQS&sig=XAweKN9n_n00ph33V7heYNjtjbI#v=onepage&q=differential%20equations%20hirsch%20smale&f=false">2013</a>). Differential equations, dynamical systems, and an introduction to chaos. Boston, US: Academic Press.</li>
<li>Riley, S. (<a href="https://science.sciencemag.org/content/316/5829/1298?casa_token=6o-2ffWgMtoAAAAA:N5r-4nxfob2OhYutIaFKh4n5kxTeTMNkiAxLdipRtmFrlIhkLL69NOYUBXdYcUPG_pT8LCiGXFLpY4DI">2007</a>). Large-scale spatial-transmission models of infectious disease. <em>Science, 316</em>(5829), 1298-1301.</li>
<li>Watts, D. J., Muhamad, R., Medina, D. C., & Dodds, P. S. (<a href="https://www.pnas.org/content/102/32/11157">2005</a>). Multiscale, resurgent epidemics in a hierarchical metapopulation model. <em>Proceedings of the National Academy of Sciences, 102</em>(32), 11157-11162.</li>
</ul>Fabian DablanderLast summer, I wrote about love affairs and linear differential equations. While the topic is cheerful, linear differential equations are severely limited in the types of behaviour they can model. In this blog post, which I spent writing in self-quarantine to prevent further spread of SARS-CoV-2 — take that, cheerfulness — I introduce nonlinear differential equations as a means to model infectious diseases. In particular, we will discuss the simple SIR and SIRS models, the building blocks of many of the more complicated models used in epidemiology. Before doing so, however, I discuss some of the basic tools of nonlinear dynamics applied to the logistic equation as a model for population growth. If you are already familiar with this, you can skip ahead. If you have had no prior experience with differential equations, I suggest you first check out my earlier post on the topic. I should preface this by saying that I am not an epidemiologist, and that no analysis I present here is specifically related to the current SARS-CoV-2 pandemic, nor should anything I say be interpreted as giving advice or making predictions. I am merely interested in differential equations, and as with love affairs, infectious diseases make a good illustrating case. So without further ado, let’s dive in! Modeling Population Growth Before we start modeling infectious diseases, it pays to study the concepts required to study nonlinear differential equations on a simple example: modeling population growth. Let $N > 0$ denote the size of a population and assume that its growth depends on itself: \[\frac{dN}{dt} = \dot{N} = r N \enspace .\] As shown in a previous blog post, this leads to exponential growth for $r > 0$: \[N(t) = N_0 e^{r t} \enspace ,\] where $N_0 = N(0)$ is the initial population size at time $t = 0$. The figure below visualizes the differential equation (left panel) and its solution (right panel) for $r = 1$ and an initial population of $N_0 = 2$. This is clearly not a realistic model since the growth of a population depends on resources, which are finite. To model finite resources, we write: \[\dot{N} = rN \left(1 - \frac{N}{K}\right) \enspace ,\] where $r > 0$ and $K$ is the so-called carrying capacity, that is, the maximum sized population that can be sustained by the available resources. Observe that as $N$ grows and if $K > N$, then $(1 - N / K)$ gets smaller, slowing down the growth rate $\dot{N}$. If on the other hand $N > K$, then the population needs more resources than are available, and the growth rate becomes negative, resulting in population decrease. For simplicity, let $K = 1$ and interpret $N \in [0, 1]$ as the proportion with respect to the carrying capacity; that is, $N = 1$ implies that we are at carrying capacity. The figure below visualizes the differential equation and its solution for $r = 1$ and an initial condition $N_0 = 0.10$. In contrast to exponential growth, the logistic equation leads to sigmoidal growth which approaches the carrying capacity. This is much more interesting behaviour than the linear differential equation above allows. In particular, the logistic equation has two fixed points — points at which the population neither increases nor decreases but stays fixed, that is, where $\dot{N} = 0$. These occur at $N = 0$ and at $N = 1$, as can be inferred from the left panel in the figure above. Analyzing the Stability of Fixed Points What is the stability of these fixed points? Intuitively, $N = 0$ should be unstable; if there are individuals, then they procreate and the population increases. Similarly, $N = 1$ should be stable: if $N < 1$, then $\dot{N} > 0$ and the population grows towards $N = 1$, and if $N > 1$, then $\dot{N} < 0$ and individuals die until $N = 1$. To make this argument more rigorous, and to get a more quantitative assessment of how quickly perturbations move away from or towards a fixed point, we derive a differential equation for these small perturbations close to the fixed point (see also Strogatz, 2015, p. 24). Let $N^{\star}$ denote a fixed point and define $\eta(t) = N(t) - N^{\star}$ to be a small perturbation close to the fixed point. We derive a differential equation for $\eta$ by writing: \[\frac{d\eta}{dt} = \frac{d}{dt}\left(N(t) - N^{\star}\right) = \frac{dN}{dt} \enspace ,\] since $N^{\star}$ is a constant. This implies that the dynamics of the perturbation equal the dynamics of the population. Let $f(N)$ denote the differential equation for $N$, observe that $N = N^{\star} + \eta$ such that $\dot{N} = \dot{\eta} = f(N) = f(N^{\star} + \eta)$. Recall that $f$ is a nonlinear function, and nonlinear functions are messy to deal with. Thus, we simply pretend that the function is linear close to the fixed point. More precisely, we approximate $f$ around the fixed point using a Taylor series (see this excellent video for details) by writing: \[f(N^{\star} + \eta) = f(N^{\star}) + \eta f'(N^{\star}) + \mathcal{O}(\eta^2) \enspace ,\] where we have ignored higher order terms. Note that, by definition, there is no change at the fixed point, that is, $f(N^{\star}) = 0$. Assuming that $f’(N^{\star}) \neq 0$ — as otherwise the higher-order terms matter, as there would be nothing else — we have that close to a fixed point \[\dot{\eta} \approx \eta f'(N^{\star}) \enspace ,\] which is a linear differential equation with solution: \[\eta(t) = \eta_0 e^{f'(N^{\star})t} \enspace .\] Using this trick, we can assess the stability of $N^{\star}$ as follows. If $f’(N^{\star}) < 0$, the small perturbation $\eta(t)$ around the fixed point decays towards zero, and so the system returns to the fixed point — the fixed point is stable. On the other hand, if $f’(N^{\star}) > 0$, then the small perturbation $\eta(t)$ close to the fixed point grows, and so the system does not return to the fixed point — the fixed point is unstable. Applying this to our logistic equation, we see that: \[\begin{aligned} f'(N) &= \frac{d}{dN} \left(rN(1 - N)\right) \\[0.50em] &= \frac{d}{dN} \left(rN - rN^2\right) \\[0.50em] & = r - 2rN \\[0.50em] &= r(1 - 2N) \enspace . \end{aligned}\] Plugging in our two fixed points $N^{\star} = 0$ and $N^{\star} = 1$, we find that $f’(0) = r$ and $f’(1) = -r$. Since $r > 0$, this confirms our suspicion that $N^{\star} = 0$ is unstable and $N^{\star} = 1$ is stable. In addition, this analysis tells us how quickly the perturbations grow or decay; for the logistic equation, this is given by $r$. In sum, we have linearized a nonlinear system close to fixed points in order to assess the stability of these fixed points, and how quickly perturbations close to these fixed points grow or decay. This technique is called linear stability analysis. In the next two sections, we discuss two ways to solve differential equations using the logistic equation as an example. Analytic Solution In contrast to linear differential equations, which was the topic of a previous blog post, nonlinear differential equations can usually not be solved analytically; that is, we generally cannot get an expression that, given an initial condition, tells us the state of the system at any time point $t$. The logistic equation can, however, be solved analytically and it might be instructive to see how. We write: \[\begin{aligned} \frac{dN}{dt} &= rN (1 - N) \\ \frac{dN}{N(1 - N)} &= r dt \\ \int \frac{1}{N(1 - N)} dN &= r t \enspace . \end{aligned}\] Staring at this for a bit, we realize that we can use partial fractions to split the integral. We write: \[\begin{aligned} \int \frac{1}{N(1 - N)} dN &= r t \\[0.50em] \int \frac{1}{N} dN + \int \frac{1}{1 - N}dN &= rt \\[0.50em] \text{log}N - \text{log}(1 - N) + Z &= rt \\[0.50em] e^{\text{log}N - \text{log}(1 - N) + Z} &= e^{rt} \enspace . \end{aligned}\] The exponents and the logs cancel each other nicely. We write: \[\begin{aligned} \frac{e^{\text{log}N}}{e^{\text{log}(1 - N)}}e^Z &= e^{rt} \\[0.50em] \frac{N}{1 - N} e^Z &= e^{rt} \\[0.50em] \frac{N}{1 - N} &= e^{rt - Z} \\[0.50em] N &= e^{rt - Z} - N e^{rt - Z} \\[0.50em] N\left(1 + e^{rt - Z}\right) &= e^{rt - Z} \\[0.50em] N &= \frac{e^{rt - Z}}{1 + e^{rt - Z}} \enspace . \end{aligned}\] One last trick is to multiply by $e^{-rt + Z}$, which yields: \[N = \frac{\left(e^{-rt + Z}\right)\left(e^{rt - Z}\right)}{\left(e^{-rt + Z}\right) + {\left(e^{-rt + Z}\right)\left(e^{-rt + Z}\right)}} = \frac{1}{1 + e^{-rt + Z}} \enspace ,\] where $Z$ is the constant of integration. To solve for it, we need the initial condition. Suppose that $N(0) = N_0$, which, using the third line in the derivation above and the fact that $t = 0$, leads to: \[\begin{aligned} \text{log}N_0 - \text{log}(1 - N_0) + Z &= 0 \\[0.50em] \text{log}N_0 - \text{log}(1 - N_0) &= -Z \\[0.50em] \frac{N_0}{1 - N_0} = e^{-Z} \\[0.50em] \frac{1 - N_0}{N_0} = e^{Z} \enspace . \end{aligned}\] Plugging this into our solution from above yields: \[N(t) = \frac{1}{1 + e^{-rt + Z}} = \frac{1}{1 + \frac{1 - N_0}{N_0} e^{-rt}} \enspace .\] While this was quite a hassle, other nonlinear differential equations are much, much harder to solve, and most do not admit a closed-form solution — or at least if they do, the resulting expression is generally not very intuitive. Luckily, we can compute the time-evolution of the system using numerical methods, as illustrated in the next section. Numerical Solution A differential equation implicitly encodes how the system we model changes over time. Specifically, given a particular (potentially high-dimensional) state of the system at time point $t$, $\mathbf{x}_t$, we know in which direction and how quickly the system will change because this is exactly what is encoded in the differential equation $f = \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t}$. This suggests the following numerical approximation: Assume we know the state of the system at a (discrete) time point $n$, denoted $x_n$, and that the change in the system is constant over a small interval $\Delta_t$. Then, the position of the system at time point $n + 1$ is given by: \[\mathbf{x}_{n + 1} = \mathbf{x}_n + \Delta t \cdot f(\mathbf{x}_n) \enspace .\] $\Delta t$ is an important parameter, encoding over what time period we assume the change $f$ to be constant. We can code this up in R for the logistic equation:Reviewing one year of blogging2019-12-27T12:00:00+00:002019-12-27T12:00:00+00:00https://fabiandablander.com/r/Reviewing-2019<p>Writing blog posts has been one of the most rewarding experiences for me over the last year. Some posts turned out quite long, others I could keep more concise. Irrespective of length, however, I have managed to publish one post every month, and you can infer the occassional frenzy that ensued from the distribution of the dates the posts appeared on — nine of them saw the light within the last three days of a month.</p>
<p>Some births were easier than others, yet every post evokes distinct memories: of perusing history books in the library and the Saturday sun; of writing down Gaussian integrals in overcrowded trains; of solving differential equations while singing; of hunting down typos before hurrying to parties. So to end this very productive year of blogging, below I provide a teaser of each previous post, summarizing one or two key takeaways. Let’s go!</p>
<!-- I started this blog last January, aiming to publish one blog post per month. It has been an extremely rewarding experience: every post allowed me to dive into a topic in a playful manner, and I was anew excited every month, wondering what I would write about. Some posts turned out quite lengthy, others were more concise. Far be it for me to suppose you have read every one of them, so to end this very productive year of blogging, this post provides a teaser of each previous post, summarizing one or two key take-aways. I hope you enjoy the show! -->
<!-- Blogging is great. In this post, I review what has happened since the inception of this blog in January. I will briefly summarize each blog post, and stress what I think are some key ideas. I will do so in reverse chronological order, starting with the most recent post. -->
<h1 id="an-introduction-to-causal-inference">An introduction to Causal inference</h1>
<p>Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. It dethrones randomized control trials as the only tool to license causal statements, describing the conditions under which this feat is possible even in observational data.</p>
<p>One key takeaway is to think about causal inference in a hierarchy. Association is at the most basic level, merely allowing us to say that two variables are somehow related. Moving upwards, the <em>do</em>-operator allows us to model interventions, answering questions such as “what would happen if we force every patient to take the drug”? Directed Acyclic Graphs (DAGs), as visualized in the figure below, allow us to visualize associations and causal relations.</p>
<center>
<img src="../assets/img/Seeing-vs-Doing-II.png" align="center" style="padding: 00px 00px 00px 00px;" width="750" height="500" />
</center>
<p>On the third and final level we find counterfactual statements. These follow from so-called <em>Structural Causal Models</em> — the building block of this approach to causal inference. Counterfactuals allow us to answer questions such as “would the patient have recovered had she been given the drug, even though she has not received the drug and did not recover”? Needless to say, this requires strong assumptions; yet if we want to endow machines with human-level reasoning or formalize concepts such as fairness, we need to make such strong assumptions.</p>
<p>One key practical take a way from this blog post is the definition of confounding: an effect is confounded if $p(Y \mid X) \neq p(Y \mid do(X = x))$. This means that blindly entering all variables into a regression to “control” for them is misguided; instead, one should carefuly think about the underlying causal relations between variables so as to not induce spurious associations. You can read the full blog post <a href="https://fabiandablander.com/r/Causal-Inference.html">here</a>.</p>
<h1 id="a-brief-primer-on-variational-inference">A brief primer on Variational Inference</h1>
<p>Bayesian inference using Markov chain Monte Carlo can be notoriously slow. The key idea behind variational inference is to recast Bayesian inference as an optimization problem. In particular, we try to find a distribution $q^\star(\mathbf{z})$ that best approximates the posterior distribution $p(\mathbf{z} \mid \mathbf{x})$ in terms of the Kullback-Leibler divergence:</p>
\[q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace .\]
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>In this blog post, I explain how a particular form of variational inference — <em>coordinate ascent mean-field variational inference</em> — leads to fast computations. Specifically, I walk you through deriving the variational inference scheme for a simple linear regression example. One key takeaway from this post is that Bayesians can use optimization to speed up computation. However, variational inference requires problem-specific, often tedious calculations. Black-box variational inference schemes can alleviate this issue, but Stan’s implementation — <em>automatic differentiation variational inference</em> — seems to work poorly, as detailed in the post (see also Ben Goodrich’s comment). You can read the full blog post <a href="https://fabiandablander.com/r/Variational-Inference.html">here</a>.</p>
<h1 id="harry-potter-and-the-power-of-bayesian-constrained-inference">Harry Potter and the Power of Bayesian Constrained Inference</h1>
<p>Are you a Gryffindor, Slytherin, Hufflepuff, or Ravenclaw? In this blog post, I explain a <em>prior predictive</em> perspective on model selection by having Harry, Ron, and Hermione — three subjective Bayesians — engage in a small prediction contest. There are two key takeaways. First, the prior does not completely constrain a model’s prediction, as these are being made by combining the prior with the likelihood. For example, even though Ron has a point prior on $\theta = 0.50$ in the figure below, his prediction is not that $y = 5$ always; instead, he predicts a distribution that is centered around $y = 5$. Similarly, while Hermione believes that $\theta > 0.50$, she puts probability mass on values $y < 5$.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>The second takeaway is computational. In particular, one can compute the Bayes factor of the <em>unconstrained</em> model ($\mathcal{M}_1$) — in which the parameter $\theta$ is free to vary — against a <em>constrained</em> model ($\mathcal{M}_r$) — in which $\theta$ is order-constrained (e.g., $\theta > 0.50$) — as:</p>
\[\text{BF}_{r1} = \frac{p(\theta \in [0.50, 1] \mid y, \mathcal{M}_1)}{p(\theta \in [0.50, 1] \mid \mathcal{M}_1)} \enspace .\]
<p>In words, this Bayes factor is given by the ratio of the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. You can read the full blog post <a href="https://fabiandablander.com/r/Bayes-Potter.html">here</a>.</p>
<h1 id="love-affairs-and-linear-differential-equations">Love affairs and linear differential equations</h1>
<blockquote>
When you can fall for chains of silver, you can fall for chains of gold <br />
You can fall for pretty strangers and the promises they hold <br />
You promised me everything, you promised me thick and thin, yeah <br />
Now you just say "Oh, Romeo, yeah, you know I used to have a scene with him"
</blockquote>
<p>Differential equations are the sine qua non of modeling how systems change. This blog post provides an introduction to <em>linear</em> differential equations, which admit closed-form solutions, and analyzes the stability of fixed points.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>The key takeaways are that the natural basis of analysis is the basis spanned by the eigenvectors, and that the stability of fixed points depends directly on the eigenvalues. A system with imaginary eigenvalues can exhibit oscillating behaviour, as shown in the figure above.</p>
<p>I think I rarely had more fun writing than when writing this blog post. Inspired by Strogatz (1988), it playfully introduces linear differential equations by classifying the types of relationships Romeo and Juliet might find themselves in. While writing it, I also listened to a lot of Dire Straits, Bob Dylan, Daft Punk, and others, whose lyrics decorate the post’s section. You can read the full blog post <a href="https://fabiandablander.com/r/Linear-Love.html">here</a>.</p>
<h1 id="the-fibonacci-sequence-and-linear-algebra">The Fibonacci sequence and linear algebra</h1>
<p>1, 1, 2, 3, 5, 8, 13, 21, … The Fibonacci sequence might well be the most widely known mathematical sequence. In this blog post, I discuss how Leonardo Bonacci derived it as a solution to a puzzle about procreating rabbits, and how linear algebra can help us find a closed-form expression of the $n^{\text{th}}$ Fibonacci number.</p>
<div style="text-align:center;">
<img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" />
</div>
<p>The key insight is to realize that the $n^{\text{th}}$ Fibonacci number can be computed by repeatedly performing matrix multiplications. If one <em>diagonalizes</em> this matrix, changing basis to — again! — the eigenbasis, then the repeated application of this matrix can be expressed as a scalar power, yielding a closed-form expression of the $n^{\text{th}}$ Fibonacci number. That’s a mouthful; you can read the blog post which explains things much better <a href="https://fabiandablander.com/r/Fibonacci.html">here</a>.</p>
<h1 id="spurious-correlations-and-random-walks">Spurious correlations and random walks</h1>
<p>I was at the Santa Fe Complex Systems Summer School — the experience of a lifetime — when Anton Pichler and Andrea Bacilieri, two economists, told me that two independent random walks can be correlated substantially. I was quite shocked, to be honest. This blog post investigates this issue, concluding that regressing one random walk onto another is <em>nonsensical</em>, that is, leads to an inconsistent parameter estimate.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>As the figure above shows, such spurious correlation also occurs for independent AR(1) processes with increasing autocorrelation $\phi$, even though the resulting estimate is consistent. The key takeaway is therefore to be careful when correlating time-series. You can read the full blog post <a href="https://fabiandablander.com/r/Spurious-Correlation.html">here</a>.</p>
<h1 id="bayesian-modeling-using-stan-a-case-study">Bayesian modeling using Stan: A case study</h1>
<p>Model selection is a difficult problem. In Bayesian inference, we may distinguish between two approaches to model selection: a <em>(prior) predictive</em> perspective based on marginal likelihoods, and a <em>(posterior) predictive</em> perspective based on leave-one-out cross-validation.</p>
<p><img src="../assets/img/prediction-perspectives.png" align="center" style="padding: 10px 10px 10px 10px;" /></p>
<p>A prior predictive perspective — illustrated in the left part of the figure above — evaluates models based on their predictions about the data actually observed. These predictions are made by combining likelihood and prior. In contrast, a posterior predictive perspective — illustrated in the right panel of the figure above — evaluates models based on their predictions about data that we have not observed. These predictions cannot be directly computed, but can be approximated by combining likelihood and posterior in a leave-one-out cross-validation scheme. They key takeaway of this blog post is to appreciate this distinction, noting that not all Bayesians agree on how to select among models.</p>
<p>The post illustrates these two perspectives with a case study: does the relation between practice and reaction time follow a power law or an exponential function? You can read the full blog post <a href="https://fabiandablander.com/r/Law-of-Practice.html">here</a>.</p>
<h1 id="two-perspectives-on-regularization">Two perspectives on regularization</h1>
<p>Regularization is the process of adding information to an estimation problem so as to avoid extreme estimates. This blog post explores regularization both from a Bayesian and from a classical perspective, using the simplest example possible: estimating the bias of a coin.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>The key takeaway is the observation that Bayesians have a natural tool for regularization at their disposal: the prior. In contrast to the left panel in the figure above, which shows a flat prior, the right panel illustrates that using a weakly informative prior that peaks at $\theta = 0.50$ shifts the resulting posterior distribution towards that value. In classical statistics, one usually uses penalized maximum likelihood approaches — think lasso and ridge regression — to achieve regularization. You can read the full blog post <a href="https://fabiandablander.com/r/Regularization.html">here</a>.</p>
<h1 id="variable-selection-using-gibbs-sampling">Variable selection using Gibbs sampling</h1>
<p>“Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using spike-and-slab priors and the Gibbs sampler, a computational method to sample from a joint distribution using only conditional distributions.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Parameter estimation is almost always conditional on a specific model. One key takeaway from this blog post is that there is uncertainty associated with the model itself. The approach outlined in the post accounts for this uncertainty by using spike-and-slab priors, yielding posterior distributions not only for parameters but also for models. To incorporate this model uncertainty into parameter estimation, one can average across models; the figure above shows the <em>model-averaged</em> posterior distribution for six variables discussed in the post. You can read the full blog post <a href="https://fabiandablander.com/r/Spike-and-Slab.html">here</a>.</p>
<h1 id="two-properties-of-the-gaussian-distribution">Two properties of the Gaussian distribution</h1>
<p>The Gaussian distribution is special for a number of reasons. In this blog post, I focus on two such reasons, namely the fact that it is closed under marginalization and conditioning. This means that if you start out with a <em>p</em>-dimensional Gaussian distribution, and you either <em>marginalize over</em> or <em>condition on</em> one of its components, the resulting distribution will again be Gaussian.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>The figure above illustrates the difference between marginalization and conditioning in the two-dimensional case. The left panel shows a bivariate Gaussian distribution with a high correlation $\rho = 0.80$ (blue contour lines). Conditioning means incorporating information, and observing that $X_2 = 2$ shifts the distribution of $X_1$ towards this value (purple line). If we do not observe $X_2$, we can incorporate our uncertainty about its likely values by marginalizing it out. This results in a Gaussian distribution that is centered on zero (black line). The right panel shows that conditioning on $X_2 = 2$ does not change the distribution of $X_1$ in the case of no correlation $\rho = 0$. You can read the full blog post <a href="https://fabiandablander.com/statistics/Two-Properties.html">here</a>.</p>
<h1 id="curve-fitting-and-the-gaussian-distribution">Curve fitting and the Gaussian distribution</h1>
<p>In this blog post, we take a look at the mother of all curve fitting problems — fitting a straight line to a number of points. The figure below shows that one point in the Euclidean plane is insufficient to define a line (left), two points constrain it perfectly (middle), and three is too much (right). In science we usually deal with more than two data points which are corrupted by noise. How do we fit a line to such noisy observations?</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-8-1.png" title="plot of chunk unnamed-chunk-8" alt="plot of chunk unnamed-chunk-8" style="display: block; margin: auto 0 auto auto;" /></p>
<p>The methods of least squares provides an answer. In addition to an explanation of least squares, a key takeaway of this post is an understanding for the historical context in which least squares arose. Statistics is fascinating in part because of its rich history. On our journey through time we meet Legendre, Gauss, Laplace, and Galton. The latter describes the central limit theorem — one of the most stunning theorems in statistics — in beautifully poetic words:</p>
<blockquote>
<p>“I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66)</p>
</blockquote>
<p>You can read the full blog post <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">here</a>.</p>
<p>I hope that you enjoyed reading some of these posts at least a quarter as much as I enjoyed writing them. I am committed to making 2020 a successful year of blogging, too. However, I will most likely decrease the output frequency by half, aiming to publish one post every two months. It is a truth universally acknowledged that a person in want of a PhD must be in possession of publications, and so I will have to shift my focus accordingly (at least a little bit). At the same time, I also want to further increase my involvement in the “data for the social good” scene. Life certainly is one complicated optimization problem. I wish you all the best for the new year!</p>
<hr />
<p><em>I would like to thank Don van den Bergh, Sophia Crüwell, Jonas Haslbeck, Oisín Ryan, Lea Jakob, Quentin Gronau, Nathan Evans, Andrea Bacilieri, and Anton Pichler for helpful comments on (some of) these blog posts.</em></p>Fabian DablanderWriting blog posts has been one of the most rewarding experiences for me over the last year. Some posts turned out quite long, others I could keep more concise. Irrespective of length, however, I have managed to publish one post every month, and you can infer the occassional frenzy that ensued from the distribution of the dates the posts appeared on — nine of them saw the light within the last three days of a month. Some births were easier than others, yet every post evokes distinct memories: of perusing history books in the library and the Saturday sun; of writing down Gaussian integrals in overcrowded trains; of solving differential equations while singing; of hunting down typos before hurrying to parties. So to end this very productive year of blogging, below I provide a teaser of each previous post, summarizing one or two key takeaways. Let’s go! An introduction to Causal inference Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. It dethrones randomized control trials as the only tool to license causal statements, describing the conditions under which this feat is possible even in observational data. One key takeaway is to think about causal inference in a hierarchy. Association is at the most basic level, merely allowing us to say that two variables are somehow related. Moving upwards, the do-operator allows us to model interventions, answering questions such as “what would happen if we force every patient to take the drug”? Directed Acyclic Graphs (DAGs), as visualized in the figure below, allow us to visualize associations and causal relations. On the third and final level we find counterfactual statements. These follow from so-called Structural Causal Models — the building block of this approach to causal inference. Counterfactuals allow us to answer questions such as “would the patient have recovered had she been given the drug, even though she has not received the drug and did not recover”? Needless to say, this requires strong assumptions; yet if we want to endow machines with human-level reasoning or formalize concepts such as fairness, we need to make such strong assumptions. One key practical take a way from this blog post is the definition of confounding: an effect is confounded if $p(Y \mid X) \neq p(Y \mid do(X = x))$. This means that blindly entering all variables into a regression to “control” for them is misguided; instead, one should carefuly think about the underlying causal relations between variables so as to not induce spurious associations. You can read the full blog post here. A brief primer on Variational Inference Bayesian inference using Markov chain Monte Carlo can be notoriously slow. The key idea behind variational inference is to recast Bayesian inference as an optimization problem. In particular, we try to find a distribution $q^\star(\mathbf{z})$ that best approximates the posterior distribution $p(\mathbf{z} \mid \mathbf{x})$ in terms of the Kullback-Leibler divergence: \[q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace .\] In this blog post, I explain how a particular form of variational inference — coordinate ascent mean-field variational inference — leads to fast computations. Specifically, I walk you through deriving the variational inference scheme for a simple linear regression example. One key takeaway from this post is that Bayesians can use optimization to speed up computation. However, variational inference requires problem-specific, often tedious calculations. Black-box variational inference schemes can alleviate this issue, but Stan’s implementation — automatic differentiation variational inference — seems to work poorly, as detailed in the post (see also Ben Goodrich’s comment). You can read the full blog post here. Harry Potter and the Power of Bayesian Constrained Inference Are you a Gryffindor, Slytherin, Hufflepuff, or Ravenclaw? In this blog post, I explain a prior predictive perspective on model selection by having Harry, Ron, and Hermione — three subjective Bayesians — engage in a small prediction contest. There are two key takeaways. First, the prior does not completely constrain a model’s prediction, as these are being made by combining the prior with the likelihood. For example, even though Ron has a point prior on $\theta = 0.50$ in the figure below, his prediction is not that $y = 5$ always; instead, he predicts a distribution that is centered around $y = 5$. Similarly, while Hermione believes that $\theta > 0.50$, she puts probability mass on values $y < 5$. The second takeaway is computational. In particular, one can compute the Bayes factor of the unconstrained model ($\mathcal{M}_1$) — in which the parameter $\theta$ is free to vary — against a constrained model ($\mathcal{M}_r$) — in which $\theta$ is order-constrained (e.g., $\theta > 0.50$) — as: \[\text{BF}_{r1} = \frac{p(\theta \in [0.50, 1] \mid y, \mathcal{M}_1)}{p(\theta \in [0.50, 1] \mid \mathcal{M}_1)} \enspace .\] In words, this Bayes factor is given by the ratio of the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. You can read the full blog post here. Love affairs and linear differential equations When you can fall for chains of silver, you can fall for chains of gold You can fall for pretty strangers and the promises they hold You promised me everything, you promised me thick and thin, yeah Now you just say "Oh, Romeo, yeah, you know I used to have a scene with him" Differential equations are the sine qua non of modeling how systems change. This blog post provides an introduction to linear differential equations, which admit closed-form solutions, and analyzes the stability of fixed points. The key takeaways are that the natural basis of analysis is the basis spanned by the eigenvectors, and that the stability of fixed points depends directly on the eigenvalues. A system with imaginary eigenvalues can exhibit oscillating behaviour, as shown in the figure above. I think I rarely had more fun writing than when writing this blog post. Inspired by Strogatz (1988), it playfully introduces linear differential equations by classifying the types of relationships Romeo and Juliet might find themselves in. While writing it, I also listened to a lot of Dire Straits, Bob Dylan, Daft Punk, and others, whose lyrics decorate the post’s section. You can read the full blog post here. The Fibonacci sequence and linear algebra 1, 1, 2, 3, 5, 8, 13, 21, … The Fibonacci sequence might well be the most widely known mathematical sequence. In this blog post, I discuss how Leonardo Bonacci derived it as a solution to a puzzle about procreating rabbits, and how linear algebra can help us find a closed-form expression of the $n^{\text{th}}$ Fibonacci number. The key insight is to realize that the $n^{\text{th}}$ Fibonacci number can be computed by repeatedly performing matrix multiplications. If one diagonalizes this matrix, changing basis to — again! — the eigenbasis, then the repeated application of this matrix can be expressed as a scalar power, yielding a closed-form expression of the $n^{\text{th}}$ Fibonacci number. That’s a mouthful; you can read the blog post which explains things much better here. Spurious correlations and random walks I was at the Santa Fe Complex Systems Summer School — the experience of a lifetime — when Anton Pichler and Andrea Bacilieri, two economists, told me that two independent random walks can be correlated substantially. I was quite shocked, to be honest. This blog post investigates this issue, concluding that regressing one random walk onto another is nonsensical, that is, leads to an inconsistent parameter estimate. As the figure above shows, such spurious correlation also occurs for independent AR(1) processes with increasing autocorrelation $\phi$, even though the resulting estimate is consistent. The key takeaway is therefore to be careful when correlating time-series. You can read the full blog post here. Bayesian modeling using Stan: A case study Model selection is a difficult problem. In Bayesian inference, we may distinguish between two approaches to model selection: a (prior) predictive perspective based on marginal likelihoods, and a (posterior) predictive perspective based on leave-one-out cross-validation. A prior predictive perspective — illustrated in the left part of the figure above — evaluates models based on their predictions about the data actually observed. These predictions are made by combining likelihood and prior. In contrast, a posterior predictive perspective — illustrated in the right panel of the figure above — evaluates models based on their predictions about data that we have not observed. These predictions cannot be directly computed, but can be approximated by combining likelihood and posterior in a leave-one-out cross-validation scheme. They key takeaway of this blog post is to appreciate this distinction, noting that not all Bayesians agree on how to select among models. The post illustrates these two perspectives with a case study: does the relation between practice and reaction time follow a power law or an exponential function? You can read the full blog post here. Two perspectives on regularization Regularization is the process of adding information to an estimation problem so as to avoid extreme estimates. This blog post explores regularization both from a Bayesian and from a classical perspective, using the simplest example possible: estimating the bias of a coin. The key takeaway is the observation that Bayesians have a natural tool for regularization at their disposal: the prior. In contrast to the left panel in the figure above, which shows a flat prior, the right panel illustrates that using a weakly informative prior that peaks at $\theta = 0.50$ shifts the resulting posterior distribution towards that value. In classical statistics, one usually uses penalized maximum likelihood approaches — think lasso and ridge regression — to achieve regularization. You can read the full blog post here. Variable selection using Gibbs sampling “Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using spike-and-slab priors and the Gibbs sampler, a computational method to sample from a joint distribution using only conditional distributions. Parameter estimation is almost always conditional on a specific model. One key takeaway from this blog post is that there is uncertainty associated with the model itself. The approach outlined in the post accounts for this uncertainty by using spike-and-slab priors, yielding posterior distributions not only for parameters but also for models. To incorporate this model uncertainty into parameter estimation, one can average across models; the figure above shows the model-averaged posterior distribution for six variables discussed in the post. You can read the full blog post here. Two properties of the Gaussian distribution The Gaussian distribution is special for a number of reasons. In this blog post, I focus on two such reasons, namely the fact that it is closed under marginalization and conditioning. This means that if you start out with a p-dimensional Gaussian distribution, and you either marginalize over or condition on one of its components, the resulting distribution will again be Gaussian. The figure above illustrates the difference between marginalization and conditioning in the two-dimensional case. The left panel shows a bivariate Gaussian distribution with a high correlation $\rho = 0.80$ (blue contour lines). Conditioning means incorporating information, and observing that $X_2 = 2$ shifts the distribution of $X_1$ towards this value (purple line). If we do not observe $X_2$, we can incorporate our uncertainty about its likely values by marginalizing it out. This results in a Gaussian distribution that is centered on zero (black line). The right panel shows that conditioning on $X_2 = 2$ does not change the distribution of $X_1$ in the case of no correlation $\rho = 0$. You can read the full blog post here. Curve fitting and the Gaussian distribution In this blog post, we take a look at the mother of all curve fitting problems — fitting a straight line to a number of points. The figure below shows that one point in the Euclidean plane is insufficient to define a line (left), two points constrain it perfectly (middle), and three is too much (right). In science we usually deal with more than two data points which are corrupted by noise. How do we fit a line to such noisy observations? The methods of least squares provides an answer. In addition to an explanation of least squares, a key takeaway of this post is an understanding for the historical context in which least squares arose. Statistics is fascinating in part because of its rich history. On our journey through time we meet Legendre, Gauss, Laplace, and Galton. The latter describes the central limit theorem — one of the most stunning theorems in statistics — in beautifully poetic words: “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66) You can read the full blog post here. I hope that you enjoyed reading some of these posts at least a quarter as much as I enjoyed writing them. I am committed to making 2020 a successful year of blogging, too. However, I will most likely decrease the output frequency by half, aiming to publish one post every two months. It is a truth universally acknowledged that a person in want of a PhD must be in possession of publications, and so I will have to shift my focus accordingly (at least a little bit). At the same time, I also want to further increase my involvement in the “data for the social good” scene. Life certainly is one complicated optimization problem. I wish you all the best for the new year! I would like to thank Don van den Bergh, Sophia Crüwell, Jonas Haslbeck, Oisín Ryan, Lea Jakob, Quentin Gronau, Nathan Evans, Andrea Bacilieri, and Anton Pichler for helpful comments on (some of) these blog posts.An introduction to Causal inference2019-11-30T12:00:00+00:002019-11-30T12:00:00+00:00https://fabiandablander.com/r/Causal-Inference<p><em>An extended version of this blog post is available from <a href="https://psyarxiv.com/b3fkw">here</a>.</em></p>
<p>Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. In this blog post, I provide an introduction to the graphical approach to causal inference in the tradition of Sewell Wright, Judea Pearl, and others.</p>
<p>We first rehash the common adage that correlation is not causation. We then move on to climb what Pearl calls the “ladder of causal inference”, from association (<em>seeing</em>) to intervention (<em>doing</em>) to counterfactuals (<em>imagining</em>). We will discover how directed acyclic graphs describe conditional (in)dependencies; how the <em>do</em>-calculus describes interventions; and how Structural Causal Models allow us to imagine what could have been. This blog post is by no means exhaustive, but should give you a first appreciation of the concepts that surround causal inference; references to further readings are provided below. Let’s dive in!<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup></p>
<h1 id="correlation-and-causation">Correlation and Causation</h1>
<p>Messerli (2012) published a paper entitled “Chocolate Consumption, Cognitive Function, and Nobel Laureates” in <em>The New England Journal of Medicine</em> showing a strong positive relationship between chocolate consumption and the number of Nobel Laureates. I have found an even stronger relationship using updated data<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup>, as visualized in the figure below.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<!-- <center> -->
<!-- <img src="../assets/img/Chocolate.png" align="center" style="padding: 10px 10px 10px 10px;" width="450" height="150"/> -->
<!-- </center> -->
<!-- Similarly, this great website tells us that US spending on science, space, and technology correlates strongly with suicides by hanging, strangulation, and suffocation: -->
<!-- <center> -->
<!-- <!-- <img src="../assets/img/US-Spending.png" align="center" style="padding: 10px 10px 10px 10px;" width="650" height="550"/> -->
<!-- <img src="../assets/img/US-Spending.png" align="center" style="padding: 10px 10px 10px 10px;"/> -->
<!-- </center> -->
<p>Now, except for <a href="https://www.confectionerynews.com/Article/2012/10/11/Chocolate-creates-Nobel-prize-winners-says-study">people in the chocolate business</a>, it would be quite a stretch to suggest that increasing chocolate consumption would increase the number Nobel Laureates. Correlation does not imply causation because it does not constrain the possible causal relations enough. Hans Reichenbach (1956) formulated the <em>common cause principle</em> which speaks to this fact:</p>
<blockquote>
<p>If two random variables $X$ and $Y$ are statistically dependent ($X \not \perp Y$), then either (a) $X$ causes $Y$, (b) $Y$ causes $X$, or (c) there exists a third variable $Z$ that causes both $X$ and $Y$. Further, $X$ and $Y$ become independent given $Z$, i.e., $X \perp Y \mid Z$.</p>
</blockquote>
<p>An in principle straightforward way to break this uncertainty is to conduct an experiment: we could, for example, force the citizens of Austria to consume more chocolate, and study whether this increases the number of Nobel laureates in the following years. Such experiments are clearly unfeasible, but even in less extreme settings it is frequently unethical, impractical, or impossible — think of smoking and lung cancer — to study an association experimentally.</p>
<p>Causal inference provides us with tools that license causal statements even in the absence of a true experiment. This comes with strong assumptions. In the next section, we discuss the “causal hierarchy”.</p>
<h1 id="the-causal-hierarchy">The Causal Hierarchy</h1>
<p>Pearl (2019a) introduces a causal hierarchy with three levels — association, intervention, and counterfactuals — as well as three prototypical actions corresponding to each level — <em>seeing</em>, <em>doing</em>, and <em>imagining</em>. In the remainder of this blog post, we will tackle each level in turn.</p>
<h1 id="seeing">Seeing</h1>
<p>Association is on the most basic level; it makes us see that two or more things are somehow related. Importantly, we need to distinguish between <em>marginal</em> associations and <em>conditional</em> associations. The latter are the key building block of causal inference. The figure below illustrates these two concepts.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>If we look at the whole, aggregated data on the left we see that the continuous variables $X$ and $Y$ are positively correlated: an increase in values for $X$ co-occurs with an increase in values for $Y$. This relation describes the <em>marginal</em> association of $X$ and $Y$ because we do not care whether $Z = 0$ or $Z = 1$. On the other hand, if we condition on the binary variable $Z$, we find that there is no relation: $X \perp Y \mid Z$. (For more on marginal and conditional associations in the case of Gaussian distributions, see <a href="https://fabiandablander.com/statistics/Two-Properties.html">this</a> blog post). In the next section, we discuss a powerful tool that allows us to visualize such dependencies.</p>
<h2 id="directed-acyclic-graphs">Directed Acyclic Graphs</h2>
<p>We can visualize the statistical dependencies between the three variables using a graph. A graph is a mathematical object that consists of nodes and edges. In the case of <em>Directed Acyclic Graphs</em> (DAGs), these edges are directed. We take our variables $(X, Y, Z$) to be nodes in such DAG and we draw (or omit) edges between these nodes so that the conditional (in)dependence structure in the data is reflected in the graph. We will explain this more formally shortly. For now, let’s focus on the relationship between the three variables. We have seen that $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. It turns out that we can draw <em>three</em> DAGs that encode this fact; these are the first three DAGs in the figure below. $X$ and $Y$ are dependent through $Z$ in these graphs, and conditioning on $Z$ <em>blocks</em> the path between $X$ and $Y$. We state this more formally shortly.</p>
<center>
<img src="../assets/img/Seeing-II.png" align="center" style="padding: 0px 0px 0px 0px;" width="750" height="375" />
</center>
<p>While it is natural to interpret the arrows causally, we do not do so here. For now, the arrows are merely tools that help us describe associations between variables.</p>
<p>The figure above also shows a fourth DAG, which encodes a different set of conditional (in)dependence relations between $X$, $Y$, and $Z$. The figure below illustrates this: looking at the aggregated data we do not find a relation between $X$ and $Y$ — they are <em>marginally independent</em> — but we do find one when looking at the disaggregated data — $X$ and $Y$ are <em>conditionally dependent</em> given $Z$.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>A real-world example might help build intuition: Looking at people who are single and who are in a relationship as a separate group, being attractive ($X$) and being intelligent ($Y$) are two independent traits. This is what we see in the left panel in the figure above. Let’s make the reasonable assumption that both being attractive and being intelligent are positively related with being in a relationship. What does this imply? First, it implies that, on average, single people are less attractive and less intelligent (see red data points). Second, and perhaps counter-intuitively, it implies that in the population of single people (and people in a relationship, respectively), being attractive and being intelligent are <em>negatively correlated</em>. After all, if the handsome person you met at the bar were also intelligent, then he would most likely be in a relationship!</p>
<p>In this example, visualized in the fourth DAG, $Z$ is commonly called a <em>collider</em>. Suppose we want to estimate the association between $X$ and $Y$ in the whole population. Conditioning on a collider (for example, by only analyzing data from people who are not in a relationship) while computing the association between $X$ and $Y$ will lead to a different estimate, and the induced bias is known as <em>collider bias</em>. It is a serious issue not only in dating, but also for example in medicine.</p>
<p>The simple graphs shown above are the building blocks of more complicated graphs. In the next section, we describe a tool that can help us find (conditional) independencies between sets of variables.<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote">3</a></sup></p>
<!-- The conditional independence relations are easily glanced from these simple graphs. For *chains* and *forks*, $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. For *colliders*, we have that they are marginally independent, but conditionally dependent given $Z$ --- think of our dating example. For larger graphs, it is more difficult to see this. -->
<h2 id="d-separation">$d$-separation</h2>
<p>For large graphs, it is not obvious how to conclude that two nodes are (conditionally) independent. <em>d</em>-separation is a tool that allows us to check this algorithmically. We need to define some concepts:</p>
<ul>
<li>A <em>path</em> from $X$ to $Y$ is a sequence of nodes and edges such that the start and end nodes are $X$ and $Y$, respectively.</li>
<li>A conditioning set $\mathcal{L}$ is the set of nodes we condition on (it can be empty).</li>
<li>A collider along a path blocks that path. However, conditioning on a collider (or any of its descendants) unblocks that path.</li>
</ul>
<p>With these definitions out of the way, we call two nodes $X$ and $Y$ $d$-separated by $\mathcal{L}$ if conditioning on all members in $\mathcal{L}$ blocks all paths between the two nodes.</p>
<p>If this is your first encounter with $d$-separation, then this is a lot to wrap your head around. To get some practice, look at the graph on the left side. First, note that there are no <em>marginal</em> dependencies; this means that without conditioning or blocking nodes, any two nodes are connected by a path. For example, there is a path going from $X$ to $Y$ through $Z$, and there is a path from $V$ to $U$ going through $Y$ and $W$.</p>
<div style="float: left;">
<center>
<img src="../assets/img/Large-DAG.png" align="center" style="margin-right: 10px;" width="350" height="125" />
</center>
</div>
<p>However, there are a number of <em>conditional</em> independencies. For example, $X$ and $Y$ are conditionally independent given $Z$. Why? There are two paths from $X$ to $Y$: one through $Z$ and one through $W$. However, since $W$ is a collider on the path from $X$ to $Y$, the path is already blocked. The only unblocked path from $X$ to $Y$ is through $Z$, and conditioning on it therefore blocks all remaining open paths. Additionally conditioning on $W$ would unblock one path, and $X$ and $Y$ would again be associated.</p>
<p>So far, we have implicitly assumed that conditional (in)dependencies in the graph correspond to conditional (in)dependencies between variables. We make this assumption explicit now. In particular, note that <em>d</em>-separation provides us with an independence model $\perp_{\mathcal{G}}$ defined on graphs. To connect this to our standard probabilistic independence model $\perp_{\mathcal{P}}$ defined on random variables, we assume the following <em>Markov property</em>:</p>
\[X \perp_{\mathcal{G}} Y \mid Z \implies X \perp_{\mathcal{P}} Y \mid Z \enspace .\]
<p>In words, we assume that if the nodes $X$ and $Y$ are <em>d</em>-separated by $Z$ in the graph $\mathcal{G}$, the corresponding random variables $X$ and $Y$ are conditionally independent given $Z$. This implies that all conditional independencies in the data are represented in the graph.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote">4</a></sup> Moreover, the statement above implies (and is implied by) the following factorization:</p>
\[p(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace ,\]
<p>where $\text{pa}^{\mathcal{G}}(X_i)$ denotes the parents of the node $X_i$ in graph $\mathcal{G}$ (see Peters, Janzing, & Schölkopf, p. 101). A node is a parent of another node if it has an outgoing arrow to that node; for example, $X$ is a parent of $Z$ and $W$ in the graph above. The above factorization implies that a node $X$ is independent of its non-descendants given its parents.</p>
<p><em>d</em>-separation is an extremely powerful tool. Until now, however, we have only looked at DAGs to visualize (conditional) independencies. In the next section, we go beyond <em>seeing</em> to <em>doing</em>.</p>
<h1 id="doing">Doing</h1>
<p>We do not merely want to see the world, but also change it. From this section on, we are willing to interpret DAGs causally. As Dawid (2009a) warns, this is a serious step. In merely describing conditional independencies — <em>seeing</em> — the arrows in the DAG played a somewhat minor role, being nothing but “incidental construction features supporting the $d$-separation semantics” (Dawid, 2009a, p. 66). In this section, we endow the DAG with a causal meaning and interpret the arrows as denoting <em>direct causal effects</em>.</p>
<p>What is a causal effect? Following Pearl and others, we take an <em>interventionist</em> position and say that a variable $X$ has a causal influence on $Y$ if changing $X$ leads to changes in $Y$. This position is a very useful one in practice, but not everybody agrees with it (e.g., Cartwright, 2007).</p>
<p>The figure below shows the observational DAGs from above (top row) as well as the manipulated DAGs (bottom row) where we have intervened on the variable $X$, that is, set the value of the random variable $X$ to a constant $x$. Note that setting the value of $X$ cuts all incoming causal arrows since its value is thereby determined only by the intervention, not by any other factors.</p>
<center>
<img src="../assets/img/Seeing-vs-Doing-II.png" align="center" style="padding: 00px 00px 00px 00px;" width="750" height="500" />
</center>
<p>As is easily verified with $d$-separation, the first three graphs in the top row encode the same conditional independence structure. This implies that we cannot distinguish them using only observational data. Interpreting the edges causally, however, we see that the DAGs have a starkly different interpretation. The bottom row makes this apparent by showing the result of an intervention on $X$. In the leftmost causal DAG, $Z$ is on the causal path from $X$ to $Y$, and intervening on $X$ therefore influences $Y$ through $Z$. In the DAG next, to it $Z$ is on the causal path from $Y$ to $X$, and so intervening on $X$ does not influence $Y$. In the third DAG, $Z$ is a common cause and — since there is no other path from $X$ to $Y$ — intervening on $X$ does not influence $Y$. For the collider structure in the rightmost DAG, intervening on $X$ does not influence $Y$ because there is no unblocked path from $X$ to $Y$.</p>
<p>To make the distinction between seeing and doing, Pearl introduced the <em>do</em>-operator. While $p(Y \mid X = x)$ denotes the <em>observational</em> distribution, which corresponds to the process of seeing, $p(Y \mid do(X = x))$ corresponds to the <em>interventional</em> distribution, which corresponds to the process of doing. The former describes what values $Y$ would likely take on when $X$ <em>happened to be</em> $x$, while the latter describes what values $Y$ would likely take on when $X$ <em>would be set to</em> $x$.</p>
<h2 id="computing-causal-effects">Computing causal effects</h2>
<p>$P(Y \mid do(X = x))$ describes the causal effect of $X$ on $Y$, but how do we compute it? Actually <em>doing</em> the intervention might be unfeasible or unethical — side-stepping actual interventions and still getting at causal effects is the whole point of this approach to causal inference. We want to learn causal effects from observational data, and so all we have is the observational DAG. The causal quantity, however, is defined on the manipulated DAG. We need to build a bridge between the observational DAG and the manipulated DAG, and we do this by making two assumptions.</p>
<p>First, we assume that <em>interventions are local</em>. This means that if I set $X = x$, then this only influences the variable $X$, with no other direct influence on any other variable. Of course, intervening on $X$ will influence other variables, but only through $X$, not directly through us intervening. In colloquial terms, we do not have a “fat hand”, but act like a surgeon precisely targeting only a very specific part of the DAG; we say that the DAG is composed of <em>modular</em> parts. We can encode this using the factorization property above:</p>
\[p(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace ,\]
<p>which we now interpret causally. The factors in the product are sometimes called <em>causal Markov kernels</em>; they constitute the modular parts of the system.</p>
<p>Second, we assume that the mechanism by which variables interact do not change through interventions; that is, the mechanism by which a cause brings about its effects does not change whether this occurs naturally or by intervention (see e.g., Pearl, Glymour, & Jewell, p. 56).</p>
<p>With these two assumptions in hand, further note that $p(Y \mid do(X = x))$ can be understood as the <em>observational</em> distribution in the manipulated DAG — $p_m(Y \mid X = x)$ — that is, the DAG where we set $X = x$. This is because after <em>doing</em> the intervention (which catapults us into the manipulated DAG), all that is left for us to do is to <em>see</em> its effect. Observe that the leftmost and rightmost DAG above remain the same under intervention on $X$, and so the interventional distribution $p(Y \mid do(X = x))$ is just the conditional distribution $p(Y \mid X = x)$. The middle DAGs require a bit more work:</p>
\[\begin{aligned}
p(Y = y \mid do(X = x)) &= p_{m}(Y = y \mid X = x) \\[.5em]
&= \sum_{z} p_{m}(Y = y, Z = z \mid X = x) \\[.5em]
&= \sum_{z} p_{m}(Y = y \mid X = x, Z = z) \, p_m(Z = z) \\[.5em]
&= \sum_{z} p(Y = y \mid X = x, Z = z) \, p(Z = z) \enspace .
\end{aligned}\]
<p>The first equality follows by definition. The second and third equality follow from the <em>sum</em> and <em>product</em> rule of probability. The last line follows from the assumption that the mechanism through which $X$ influences $Y$ is independent of whether we set $X$ or whether $X$ naturally occurs, that is, $p_{m}(Y = y \mid X = x, Z = z) = p(Y = y \mid X = x, Z = z)$, and the assumption that interventions are local, that is, $p_m(Z = z) = p(Z = z)$. Thus, the interventional distribution we care about is equal to the conditional distribution of $Y$ given $X$ when we adjust for $Z$. Graphically speaking, this blocks the path $X \leftarrow Z \leftarrow Y$ in the left middle graph and the path $X \leftarrow Z \rightarrow Y$ in the right middle graph. If there were a path $X \rightarrow Y$ in these two latter graphs, and if we would not adjust for $Z$, then the causal effect of $X$ on $Y$ would be <em>confounded</em>. For these simple DAGs, however, it is already clear from the fact that $X$ is independent of $Y$ given $Z$ that $X$ cannot have a causal effect on $Y$. In the next section, study a more complicated graph and look at confounding more closely.</p>
<h2 id="confounding">Confounding</h2>
<p>Confounding has been given various definitions over the decades, but usually denotes the situation where a (possibly unobserved) common cause obscures the causal relationship between two or more variables. Here, we are slightly more precise and call a causal effect of $X$ on $Y$ confounded if $p(Y \mid X = x) \neq p(Y \mid do(X = x))$, which also implies that collider bias is a type of confounding. This occured in the middle two DAGs in the example above, as well as in the chocolate consumption and Nobel Laureates example at the beginning of the blog post. Confounding is the bane of observational data analysis. Helpfully, causal DAGs provide us with a tool to describe multivariate relations between variables. Once we have stated our assumptions clearly, the <em>do</em>-calculus further provides us with a means to know what variables we need to adjust for so that causal effects are unconfounded.</p>
<p>We follow Pearl, Glymour, & Jewell (2016, p. 61) and define the <em>backdoor criterion</em>:</p>
<blockquote>
<p>Given two nodes $X$ and $Y$, an adjustment set $\mathcal{L}$ fulfills the backdoor criterion if no member in $\mathcal{L}$ is a descendant of $X$ and members in $\mathcal{L}$ block all paths between $X$ and $Y$. Adjusting for $\mathcal{L}$ thus yields the causal effect of $X \rightarrow Y$.</p>
</blockquote>
<p>The key observation is that this (a) blocks all spurious, that is, non-causal paths between $X$ and $Y$, (b) leaves all directed paths from $X$ to $Y$ unblocked, and (c) creates no spurious paths.</p>
<div style="float: left;">
<center>
<img src="../assets/img/Large-DAG.png" align="center" style="margin-right: 10px;" width="350" height="125" />
</center>
</div>
<p>To see this action, let’s again look at the DAG on the left. The causal effect of $Z$ on $U$ is confounded by $X$, because in addition to the legitimate causal path $Z \rightarrow Y \rightarrow W \rightarrow U$, there is also an unblocked path $Z \leftarrow X \rightarrow W \rightarrow U$ which confounds the causal effect. The backdoor criterion would have us condition on $X$, which blocks the spurious path and renders the causal effect of $Z$ on $U$ unconfounded. Note that conditioning on $W$ would also block this spurious path; however, it would also block the causal path $Z \rightarrow Y \rightarrow W \rightarrow U$.</p>
<p>Before moving on, let’s catch a quick breath. We have already discussed a number of very important concepts. At the lowest level of the causal hierarchy — association — we have discovered DAGs and $d$-separation as a powerful tool to reason about conditional (in)dependencies between variables. Moving to intervention, the second level of the causal hierarchy, we have satisfied our need to interpret the arrows in a DAG causally. Doing so required strong assumptions, but it allowed us to go beyond <em>seeing</em> and model the outcome of interventions. This hopefully clarified the notion of confounding. In particular, collider bias is a type of confounding, which has important practical implications: we should not blindly enter all variables into a regression in order to “control” for them, but think carefully about what the underlying causal DAG could look like. Otherwise, we might induce spurious associations.</p>
<p>The concepts from causal inference can help us understand methodological phenomena that have been discussed for decades. In the next section, we apply the concepts we have seen so far to make sense of one such phenomenon: <em>Simpson’s Paradox</em>.</p>
<h1 id="example-application-simpsons-paradox">Example Application: Simpson’s Paradox</h1>
<p>This section follows the example given in Pearl, Glymour, & Jewell (2016, Ch. 1) with slight modifications. Suppose you observe $N = 700$ patients who either <em>choose</em> to take a drug or not; note that this is not a randomized control trial. The table below shows the number of recovered patients split across sex.</p>
<center>
<img src="../assets/img/Simpsons-Data-I.png" align="center" style="margin-top: 20px; margin-bottom: 20px;" width="600" height="400" />
</center>
<p>We observe that more men as well as more women recover when taking the drug (93% and 73%) compared to when not taking the drug (87% and 69%). And yet, when taken together, <em>fewer</em> patients who took the drug recovered (78%) compared to patients who did not take the drug (83%). This is puzzling — should a doctor prescribe the drug or not?</p>
<p>To answer this question, we need to compute the causal effect that taking the drug has on the probability of recovery. As a first step, we draw the causal DAG. Suppose we know that women are more likely to take the drug, that being a woman has an effect on recovery more generally, and that the drug has an effect on recovery. Moreover, we know that the <em>treatment cannot cause sex</em>. This is a trivial yet crucial observation — it is impossible to express this in purely statistical language. Causal DAGs provide us with a tool to make such an assumption explicit; the graph below makes explicit that sex ($S$) is a common cause of both drug taking ($D$) and recovery ($R$). We denote $S = 1$ as being female, $D = 1$ as having chosen the drug, and $R = 1$ as having recovered. The left DAG is observational while the right DAG indicates the intervention $do(D = d)$, that is, forcing every patient to either take the drug ($d = 1$) or to not take the drug ($d = 0$).</p>
<center>
<img src="../assets/img/Simpsons-DAG-I.png" align="center" style="margin-right: 10px;" width="600" height="300" />
</center>
<p>We are interested in the probability of recovery if we would force everybody to take, or not take, the drug; we call the difference between these two probabilities the <em>average causal effect</em>. This is key: the <em>do</em>-operator is about populations, not individuals. Using it, we cannot make statements that pertain to the recovery of an individual patient; we can only refer to the probability of recovery as defined on populations of patients. We will discuss <em>individual causal effects</em> in the section on counterfactuals at the end of the blog post.</p>
<p>Computing the average causal effect requires knowledge about the interventional distributions $p(R \mid do(D = 0))$ and $p(R \mid do(D = 1))$. As discussed above, these correspond to the conditional distribution in the manipulated DAG which is shown above on the right. The backdoor criterion tells us that the conditional distribution in the observational DAG will correspond to the interventional distribution when blocking the spurious path $D \leftarrow S \rightarrow R$. Using the adjustment formula we have derived above, we expand:</p>
\[\begin{aligned}
p(R = 1 \mid do(D = 1)) &= \sum_{s} p(R = 1\mid D = 1, S = s) \, p(S = s) \\[.5em]
&= p(R = 1\mid D = 1, S = 0) \, p(S = 0) + p(R = 1\mid D = 1, S = 1) \, p(S = 1) \\[.5em]
&= \frac{81}{87} \times \frac{87 + 270}{700} + \frac{192}{263} \times \frac{263 + 80}{700} \\[.5em]
&\approx 0.832 \enspace .
\end{aligned}\]
<p>In words, we first compute the benefit of taking the drug separately for men and women, and then we average the result by weighting it with the fraction of men and women in the population. This tells us that, if we force everybody to take the drug, about $82\%$ of people will recover. We can similarly compute the probability of recovery given we force all patients to not choose the drug:</p>
\[\begin{aligned}
p(R = 1\mid do(D = 0)) &= \sum_{s} p(R = 1\mid D = 0, S = s) \, p(S = s) \\[.5em]
&= p(R = 1\mid D = 0, S = 0) \, p(S = 0) + p(R = 1\mid D = 0, S = 1) \, p(S = 1) \\[.5em]
&= \frac{243}{270} \times \frac{87 + 270}{700} + \frac{55}{80} \times \frac{263 + 80}{700} \\[.5em]
&\approx 0.782 \enspace .
\end{aligned}\]
<p>Therefore, taking the drug does indeed have a positive effect on recovery on average, and the doctor should prescribe the drug.</p>
<p>Note that this conclusion heavily depended on the causal graph. While graphs are wonderful tools in that they make our assumptions explicit, these assumptions are — of course — not at all guaranteed to be correct. These assumptions are strong, stating that the graph must encode all causal relations between variables, and that there is no unmeasured confounding, something we can never guarantee in observational data.</p>
<p>Let’s look at a different example but with the exact same data. In particular, instead of the variable sex we look at the <em>post-treatment</em> variable blood pressure. This means we have measured blood pressure after the patients have taken the drug. Should a doctor prescribe the drug or not?</p>
<center>
<img src="../assets/img/Simpsons-Data-II.png" align="center" style="margin-top: 20px; margin-bottom: 20px;" width="700" height="550" />
</center>
<p>Since blood pressure is a post-treatment variable, it cannot influence a patient’s decision to take the drug or not. We draw the following causal DAG, which makes clear that the drug has an indirect effect on recovery through blood pressure, in addition to having a direct causal effect.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote">5</a></sup></p>
<center>
<img src="../assets/img/Simpsons-DAG-II.png" align="center" style="margin-right: 10px;" width="600" height="300" />
</center>
<p>From this DAG, we find that the causal effect of $D$ on $R$ is unconfounded. Therefore, the two causal quantities of interest are given by:</p>
\[\begin{aligned}
p(R = 1 \mid do(D = 1)) &= p(R = 1 \mid D = 1) = 0.78 \\[.5em]
p(R = 1 \mid do(D = 0)) &= p(R = 1 \mid D = 0) = 0.83 \enspace .
\end{aligned}\]
<p>This means that the drug is indeed harmful. In the general population (combined data), the drug has a negative effect. Suppose that the drug has a direct positive effect on recovery, but an indirect negative effect through blood pressure. If we only look at patients with a particular blood pressure, then only the drug’s positive effect on recovery remains. However, since the drug does influence recovery negatively through blood pressure, it would be misleading to take the association between $D$ and $R$ conditional on $Z$ as our estimate for the causal effect. In contrast to the previous example, using the aggregate data is the correct way to analyze these data in order to estimate the average causal effect.</p>
<p>So far, our treatment has been entirely model-agnostic. In the next section, we discuss Structural Causal Models (SCM) as the fundamental building block of causal inference. This will unify the previous two levels of the causal hierarchy — <em>seeing</em> and <em>doing</em> — as well as open up the third and final level: counterfactuals.</p>
<h1 id="structural-causal-models">Structural Causal Models</h1>
<p>In this section, we discuss Structural Causal Models (SCM) as the fundamental building block of causal inference. SCMs relate causal and probabilistic statements. As an example, we specify:</p>
\[\begin{aligned}
X &:= \epsilon_X \\[.5em]
Y &:= f(X, \epsilon_Y) \enspace .
\end{aligned}\]
<p>$X$ is a direct cause of $Y$ which it influences through the function $f()$, and the noise variables $\epsilon_X$ and $\epsilon_Y$ are assumed to be independent. In a SCM, we take each equation to be a causal statement, and we stress this by using the assignment symbol $:=$ instead of the equality sign $=$. Note that this is in stark contrast to standard regression models; here, we explicitly state our causal assumptions.</p>
<p>As we will see below, Structural Causal Models imply observational distributions (<em>seeing</em>), interventional distributions (<em>doing</em>), as well as counterfactuals (<em>imagining</em>). Thus, they can be seen as the fundamental building block of this approach to causal inference. In the following, we restrict the class of Structural Causal Models by allowing only linear relationships between variables and assuming independent Gaussian error terms.<sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote">6</a></sup> As an example, take the following SCM (Peters, Janzing, & Schölkopf, 2017, p. 90):</p>
\[\begin{aligned}
X &:= \epsilon_X \\[.5em]
Y &:= X + \epsilon_Y \\[.5em]
Z &:= Y + \epsilon_Z \enspace ,
\end{aligned}\]
<p>where $\epsilon_X, \epsilon_Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1)$ and $\epsilon_Z \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 0.1)$. Again, each line explicates the causal link variables. For example, we assume that $X$ has a direct causal effect on $Y$, that this effect is linear, and that it is obscured by independent Gaussian noise.</p>
<p>The assumption of Gaussian errors induces a multivariate Gaussian distribution on $(X, Y, Z)$ whose independence structure is visualized in the leftmost DAG below. The middle DAG shows an intervention on $Z$, while the rightmost DAG shows an intervention on $X$. Recall that, as discussed above, intervening on a variable cuts all incoming arrows.</p>
<center>
<img src="../assets/img/Prediction-vs-Intervention.png" align="center" style="margin-right: 10px;" width="700" height="400" />
</center>
<p>At the first level of the causal hierarchy — association — we might ask ourselves: does $X$ or $Z$ predict $Y$ better? To illustrate the answer for our example, we simulate $n = 1000$ observations from the Structural Causal model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span></code></pre></figure>
<p>The figure below shows that $Y$ has a much stronger association with $Z$ than with $X$; this is because the standard deviation of the error $\epsilon_X$ is only a tenth of the standard deviation of the error $\epsilon_Z$. For prediction, therefore, $Z$ is the more relevant variable.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>But does $Z$ actually have a causal effect on $Y$? This is a question about intervention, which is squarely located at the second level of the causal hierarchy. With the knowledge of the underlying Structural Causal Model, we can easily simulate interventions in R and visualize their outcomes:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Simulate data from the SCM where do(Z = z)</span><span class="w">
</span><span class="n">intervene_z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">cbind</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Simulate data from the SCM where do(X = x)</span><span class="w">
</span><span class="n">intervene_x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">cbind</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">datz</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">intervene_z</span><span class="p">(</span><span class="n">z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">datx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">intervene_x</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Y'</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-6</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray76'</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'P(Y)'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">datz</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Y'</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-6</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray76'</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'P(Y | do(Z = 2))'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">datx</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Y'</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-6</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray76'</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'P(Y | do(X = 2))'</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>The leftmost histogram below shows the marginal distribution of $Y$ when no intervention takes place. The histogram in the middle shows the marginal distribution of $Y$ in the manipulated DAG where we set $Z = 2$. Observe that, as indicated by the causal graph, $Z$ does not have a causal effect on $Y$ such that $p(Y \mid do(Z = 2)) = p(Y)$. The histogram on the right shows the marginal distribution of $Y$ in the manipulated DAG where we set $X = 2$.</p>
<p>Clearly, then, $X$ has a causal effect on $Y$. While we have touched on it already when discussing Simpson’s paradox, we now formally define the <em>Average Causal Effect</em>:</p>
\[\text{ACE}(X \rightarrow Y) = \mathbb{E}\left[Y \mid do(X = x + 1)\right] - \mathbb{E}\left[Y \mid do(X = x)\right] \enspace ,\]
<p>which in our case equals one, as can also be seen from the Structural Causal Model. Thus, SCMs allow us to model the outcome of interventions.<sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote">7</a></sup> However, note again that this is strictly about populations, not individuals. In the next section, we see how SCMs can allow us to climb up to the final level of the causal hierarchy, moving beyond the average to define individual causal effects.</p>
<h1 id="counterfactuals">Counterfactuals</h1>
<p>In the <em>Unbearable Lightness of Being</em>, Milan Kundera has Tomáš ask himself:</p>
<blockquote>
<p>“Was it better to be with Tereza or to remain alone?”</p>
</blockquote>
<p>To which he answers:</p>
<blockquote>
<p>“There is no means of testing which decision is better, because there is no basis for comparison. We live everything as it comes, without warning, like an actor going on cold. And what can life be worth if the first rehearsal for life is life itself?”</p>
</blockquote>
<p>Kundera is describing, as Holland (1986, p. 947) put it, the “fundamental problem of causal inference”, namely that we only ever observe one realization. If Tomáš chooses to stay with Tereza, then he cannot not choose to stay with Tereza. He cannot go back in time and revert his decision, living instead “everything as it comes, without warning”. This does not mean, however, that Tomáš cannot assess afterwards whether his choice has been wise. As a matter of fact, humans constantly evaluate mutually exclusive options, only one of which ever comes true; that is, humans reason <em>counterfactually</em>.</p>
<p>To do this formally requires strong assumptions. The <em>do</em>-operator, introduced above, is too weak to model counterfactuals. This is because it operates on distributions that are defined on populations, not on individuals. We can define an average causal effect using the <em>do</em>-operator, but — unsurprisingly — it only ever refers to averages. Structural Causal Models allow counterfactual reasoning on the level of the individual. To see this, we use a simple example.</p>
<p>Suppose we want to study the causal effect of grandma’s treatment for the common cold ($T$) on the speed of recovery ($R$). Usually, people recover from the common cold in <a href="https://en.wikipedia.org/wiki/Common_cold">seven to ten days</a>, but grandma swears she can do better with a simple intervention — we agree on doing an experiment. Assume we have the following SCM:</p>
\[\begin{aligned}
T &:= \epsilon_T \\[.5em]
R &:= \mu + \beta T + \epsilon \enspace ,
\end{aligned}\]
<p>where $\mu$ is an intercept, $\epsilon_T \sim \text{Bern}(0.50)$ indicates random assignment to either receive the treatment ($T = 1$) or not receive it ($T = 0$), and $\epsilon \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma)$. The SCM tells us that the direct causal effect of the treatment on how quickly patients recover from the common cold is $\beta$. This causal effect is obscured by individual error terms for each patient $\epsilon = (\epsilon_1, \epsilon_2, \ldots, \epsilon_N)$, which are aggregate terms for all the things left unmodelled (see <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">this</a> blog post for some history). In particular, $\epsilon_k$ summarizes all the things that have an effect on the speed of recovery for patient $k$.</p>
<p>Once we have collected the data, suppose we find that $\mu = 7$, $\beta = -2$, and $\sigma = 2$. This does speak for grandma’s treatment, since it shortens the recovery time by 2 days on average:</p>
\[\begin{aligned}
\text{ACE}(T \rightarrow R) &= \mathbb{E}\left[R \mid do(T = 1)\right] - \mathbb{E}\left[R \mid do(T = 0)\right] \\[.5em]
&= \mathbb{E}\left[\mu + \beta + \epsilon\right] - \mathbb{E}\left[\mu + \epsilon\right] \\[.5em]
&= \left(\mu + \beta\right) - \mu \\[.5em]
&= \beta \enspace .
\end{aligned}\]
<p>Given the value for $\epsilon_k$, the Structural Causal Model is fully determined, and we may write $R(\epsilon_k)$ for the speed of recovery for patient $k$. To make this example more concrete, we simulate some data in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="c1"># Structural Causal Model</span><span class="w">
</span><span class="n">e_T</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">)</span><span class="w">
</span><span class="n">e</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="nb">T</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">e_T</span><span class="w">
</span><span class="n">R</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="nb">T</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">e</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">,</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">dat</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## T R e
## [1,] 0 5.7962118 0.7962118
## [2,] 0 3.7759472 -1.2240528
## [3,] 1 3.6822394 0.6822394
## [4,] 1 0.7412738 -2.2587262
## [5,] 0 7.8660474 2.8660474
## [6,] 1 6.9607998 3.9607998</code></pre></figure>
<p>We see that the first patient did not receive the treatment ($T = 0$), took about $R = 5.80$ days to recover from the common cold, and has a unique value $\epsilon_1 = 0.78$. Would this particular patient have recovered more quickly if we had given him grandma’s treatment even though we did not? We denote this quantity of interest as $R_{T = 1}(\epsilon_1)$ to contrast it with the actually observed $R_{T = 0}(\epsilon_1)$. To compute this seemingly otherworldly quantity, we simply plug the value $T = 1$ and $\epsilon_1 = 0.78$ into our Structural Causal Model, which yields:</p>
\[R_{T = 0}(\epsilon_1) = 5 - 2 + 0.78 = 3.78 \enspace .\]
<!-- There is one remaining complication. Since $\epsilon_1 \sim \mathcal{N}(0, \sigma)$, that is, there is uncertainty as to the effect of unmodelled factors, the counterfactual quantity $R_{T = 1}(\epsilon_1)$ is not deterministic but stochastic. We can average over this uncertainty by taking the expectation, which yields the expected duration of the recovery for patient $k = 1$ when given the treatment, even though the patient did not receive the treatment and had an actual recovery speed of $5.80$. Formally, this is: -->
<!-- $$ -->
<!-- \mathbb{E}\left[R_{T = 1} \mid T = 0, R = 5.80\right] = \mathbb{E}\left[5 - 2 + \epsilon_1\right] = 3 \enspace . -->
<!-- $$ -->
<p>Using this, we can define the <em>individual causal effect</em> as:</p>
\[\begin{aligned}
\text{ICE}(R \rightarrow T) &= R_{T = 1}(\epsilon_1) - R_{T = 0}(\epsilon_1) \\[.5em]
&= 5.78 - 3.78 \\[.5em]
&= 2 \enspace ,
\end{aligned}\]
<p>which in this example is equal to the average causal effect due to the <a href="https://stats.stackexchange.com/a/385558">linearity of the underlying SCM</a> (Pearl, Glymour, & Jewell 2016, p. 106). In general, individual causal effects are not identified, and we have to resort to average causal effects.<sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote">8</a></sup></p>
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{ICE}(T \rightarrow R) &= \mathbb{E}\left[R_{T = 1} \mid T = 1, R = 5.80\right] - \mathbb{E}\left[R_{T = 1} \mid T = 0, R = 5.80\right] \\[.5em] -->
<!-- &= \mathbb{E}\left[5 - 2 + \epsilon_1\right] - \mathbb{E}\left[5 + \epsilon_1\right] \\[.5em] -->
<!-- &= 3 \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<p>Answering the question of whether a particular patient would have recovered more quickly had we given him the treatment even though we did not give him the treatment seems almost fantastical. It is a <em>cross-world</em> statement: given what we have observed, we ask about what would have been if things had turned out different. It may strike you as a bit eerie to speak about different worlds. Peters, Janzing, & Schölkopf (2017, p. 106) state that it is “debatable whether this additional [counterfactual] information [encoded in the SCM] is useful.” It certainly requires strong assumptions. More broadly, Dawid (2000) argues in favour of causal inference without counterfactuals, and he does not seem to have shifted his position in <a href="https://twitter.com/fdabl/status/1110944752571158528">recent years</a>. Yet if we want to design machines that can achieve human level reasoning, we need to endow them with counterfactual thinking (Pearl, 2019a). Moreover, many concepts that a relevant in legal and ethical domains, such as fairness (Kusner et al., 2017), require counterfactuals.</p>
<p>Before we end, note that the graphical approach to causal inference outlined in this blog post is not the only game in town. The <em>potential outcome</em> framework for causal inference developed by <a href="https://en.wikipedia.org/wiki/Rubin_causal_model">Donald Rubin</a> and others avoids graphical models and takes counterfactual quantities as primary. However, although starting from counterfactual statements that are defined at the individual level, it is my understand that most work that uses potential outcomes focuses on <em>average causal effects</em>. As outlined above, this only requires the second level of the causal hierarchy — <em>doing</em> — and are therefore much less contentious than <em>individual causal effects</em>, which sit at the top of the causal hierarchy.</p>
<p>The graphical approach outlined in this blog post and the potential outcome framework are logically equivalent (Peters, Janzing, & Schölkopf, 2017, p. 125), and although there is quite some debate surrounding the two approaches, it is probably wise to be pragmatic and simply choose the tool that works best for a particular application. As Lauritzen (2004, p. 189) put it, he sees the</p>
<blockquote>
<p>“different formalisms as different ‘languages’. The French language may be best for making love whereas the Italian may be more suitable for singing, but both are indeed possible, and I have no difficulty accepting that potential responses, structural equations, and graphical models coexist as languages expressing causal concepts each with their virtues and vices.<sup id="fnref:9" role="doc-noteref"><a href="#fn:9" class="footnote">9</a></sup>”</p>
</blockquote>
<p>For further reading, I wholeheartedly recommend the textbooks by Pearl, Glymour, & Jewell (<a href="http://bayes.cs.ucla.edu/PRIMER/">2016</a>) as well as Peters, Janzing, & Schölkopf (<a href="https://mitpress.mit.edu/books/elements-causal-inference">2017</a>). For bedtime reading, I can recommend Pearl & McKenzie (<a href="https://www.goodreads.com/book/show/36204378-the-book-of-why">2018</a>). Miguel Hernán teaches an excellent introductory online course on causal diagrams <a href="https://www.edx.org/course/causal-diagrams-draw-your-assumptions-before-your">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have touched on several key concepts of causal inference. We have started with the puzzling observation that chocolate consumption and the number of Nobel Laureates are strongly positively related. At the lowest level of the causal ladder — association — we have seen how directed acyclic graphs can help us visualize conditional independencies, and how <em>d</em>-separation provides us with an algorithmic tool to check such independencies.</p>
<p>Moving up to the second level — intervention — we have seen how the <em>do</em>-operator models populations under interventions. This helped us define <em>confounding</em> — the bane of observational data analysis — as occuring when $p(Y \mid X = x) \neq p(Y \mid do(X = x))$. This comes with the important observation that entering all variables into a regression in order to “control” for them is misguided; rather, we need to carefully think about the underlying causal relations lest we want to introduce bias by for example conditioning on a collider. The <em>backdoor criterion</em> provided us with a graphical way to assess whether an effect is confounded or not.</p>
<p>Finally, we have seen that Structural Causal Models (SCMs) provide the building block from which observational and interventional distributions follow. SCMs further imply counterfactual statements, which sit at the top of the causal hierarchy. These allow us to move beyond the <em>do</em>-operator and average causal effects: they enable us to answer questions about what would have been if things had been different.</p>
<hr />
<p><em>I would like to thank <a href="https://ryanoisin.github.io/">Oisín Ryan</a> and <a href="https://cruwell.com/">Sophia Crüwell</a> for very helpful comments on this blog.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Bollen, K. A., & Pearl, J. (<a href="https://link.springer.com/chapter/10.1007/978-94-007-6094-3_15">2013</a>). Eight myths about causality and structural equation models. In <em>Handbook of Causal Analysis for Social Research</em> (pp. 301-328). Springer, Dordrecht.</li>
<li>Cartwright, N. (2007). <em>Hunting Causes and Using them: Approaches in Philosophy and Economics</em>. Cambridge University Press.</li>
<li>Dawid, A. P. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2000.10474210">2000</a>). Causal inference without counterfactuals. <em>Journal of the American Statistical Association, 95</em>(450), 407-424.</li>
<li>Hernán, M.A., & Robins J.M. (<a href="https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/">2020</a>). <em>Causal Inference: What If</em>. Boca Raton: Chapman & Hall/CRC.</li>
<li>Holland, P. W. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1986.10478354">1986</a>). Statistics and causal inference. <em>Journal of the American statistical Association, 81</em>(396), 945-960.</li>
<li>Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (<a href="https://papers.nips.cc/paper/6995-counterfactual-fairness">2017</a>). Counterfactual fairness. In <em>Advances in Neural Information Processing Systems</em> (pp. 4066-4076).</li>
<li>Lauritzen, S. L., Aalen, O. O., Rubin, D. B., & Arjas, E. (<a href="https://www.jstor.org/stable/4616823?casa_token=aseDj2RNgjcAAAAA:iJpo1EhqcVN_89UT2AMMR0FynAC9mnake3YgBbFUoG81rNn8jbVQcQTs6NJdt3l3XDOQDRBreeILUOpvNrRglQ8CR6HQuHbg7x_F6CIIdaK_rTVfFfZMUg&seq=1#metadata_info_tab_contents">2004</a>). Discussion on Causality [with Reply]. <em>Scandinavian Journal of Statistics, 31</em>(2), 189-201.</li>
<li>Pearl, J. (<a href="https://dl.acm.org/citation.cfm?id=3241036">2019a</a>). The seven tools of causal inference, with reflections on machine learning. <em>Commun. ACM, 62</em>(3), 54-60.</li>
<li>Pearl, J. (<a href="https://www.degruyter.com/view/j/jci.2019.7.issue-1/jci-2019-2002/jci-2019-2002.xml">2019b</a>). On the Interpretation of do (x) do (x). <em>Journal of Causal Inference, 7</em>(1).</li>
<li>Pearl, J. (<a href="https://ftp.cs.ucla.edu/pub/stat_ser/r370.pdf">2012</a>). The Causal Foundations of Structural Equation Modeling.</li>
<li>Pearl, J., Glymour, M., & Jewell, N. P. (<a href="http://bayes.cs.ucla.edu/PRIMER/">2016</a>). Causal Inference in Statistics: A Primer. John Wiley & Sons.</li>
<li>Peters, J., Janzing, D., & Schölkopf, B. (<a href="https://mitpress.mit.edu/books/elements-causal-inference">2017</a>). <em>Elements of Causal Inference: Foundations and Learning Algorithms</em>. MIT Press.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>The content of this blog post is an extended write-up of a one-hour lecture I gave to $3^{\text{rd}}$ year psychology undergraduate students at the University of Amsterdam. You can view the presentation, which includes exercises at the end, <a href="https://fabiandablander.com/assets/talks/Causal-Lecture">here</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Messerli (2012) was the first to look at this relationship. The data I present here are somewhat different. I include Nobel Laureates up to 2019, and I use the 2017 chocolate consumption data as reported <a href="https://www.statista.com/statistics/819288/worldwide-chocolate-consumption-by-country/">here</a>. You can download the data set <a href="https://fabiandablander.com/assets/data/nobel-chocolate.csv">here</a>. To get the data reported by Messerli (2012) into R, you can follow <a href="http://gforge.se/2012/12/chocolate-and-nobel-prize/">this</a> blogpost. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>I can recommend <a href="https://www.edx.org/course/causal-diagrams-draw-your-assumptions-before-your">this</a> course on causal diagrams by Miguel Hernán to get more intuition for causal graphical models. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>If the converse implication, that is, the implication from the distribution to the graph holds, we say that the graph is <em>faithful</em> to the distribution. This is an important assumption in causal learning because it allows one to estimate causal relations from conditional independencies in the data. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>A causal effect is <em>direct</em> only at at particular level of abstraction. The drug works by inducing certain biochemical reactions that might themselves be described by DAGs. On a finer scale, then, the direct effect seizes to be direct. <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:6" role="doc-endnote">
<p>Structural Causal Models are closely related to Structural Equation Models. The latter allow latent variables, but their causal content has been debated throughout the last century. For more information, see for example Pearl (2012) and Bollen & Pearl (2013). <a href="#fnref:6" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:7" role="doc-endnote">
<p>For the interpretation of the <em>do</em>-operator for non-manipulable causes, see Pearl (2019b). <a href="#fnref:7" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:8" role="doc-endnote">
<p>Here, we have focused on <em>deterministic</em> counterfactuals, assigning a single value to the counterfactual $R_{T = 1}(\epsilon_1)$. This is in contrast to <em>stochastic</em> or <em>non-deterministic</em> counterfactuals, which follow a distribution. This distinction does not matter for average causal effects, but it does for individual ones (Hernán & Robins, 2020, p. 10). <a href="#fnref:8" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:9" role="doc-endnote">
<p>One can only hope that Bayesians and Frequentists become inspired by the pragmatism expressed here so poetically by Lauritzen. <a href="#fnref:9" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Fabian DablanderAn extended version of this blog post is available from here. Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. In this blog post, I provide an introduction to the graphical approach to causal inference in the tradition of Sewell Wright, Judea Pearl, and others. We first rehash the common adage that correlation is not causation. We then move on to climb what Pearl calls the “ladder of causal inference”, from association (seeing) to intervention (doing) to counterfactuals (imagining). We will discover how directed acyclic graphs describe conditional (in)dependencies; how the do-calculus describes interventions; and how Structural Causal Models allow us to imagine what could have been. This blog post is by no means exhaustive, but should give you a first appreciation of the concepts that surround causal inference; references to further readings are provided below. Let’s dive in!1 Correlation and Causation Messerli (2012) published a paper entitled “Chocolate Consumption, Cognitive Function, and Nobel Laureates” in The New England Journal of Medicine showing a strong positive relationship between chocolate consumption and the number of Nobel Laureates. I have found an even stronger relationship using updated data2, as visualized in the figure below. Now, except for people in the chocolate business, it would be quite a stretch to suggest that increasing chocolate consumption would increase the number Nobel Laureates. Correlation does not imply causation because it does not constrain the possible causal relations enough. Hans Reichenbach (1956) formulated the common cause principle which speaks to this fact: If two random variables $X$ and $Y$ are statistically dependent ($X \not \perp Y$), then either (a) $X$ causes $Y$, (b) $Y$ causes $X$, or (c) there exists a third variable $Z$ that causes both $X$ and $Y$. Further, $X$ and $Y$ become independent given $Z$, i.e., $X \perp Y \mid Z$. An in principle straightforward way to break this uncertainty is to conduct an experiment: we could, for example, force the citizens of Austria to consume more chocolate, and study whether this increases the number of Nobel laureates in the following years. Such experiments are clearly unfeasible, but even in less extreme settings it is frequently unethical, impractical, or impossible — think of smoking and lung cancer — to study an association experimentally. Causal inference provides us with tools that license causal statements even in the absence of a true experiment. This comes with strong assumptions. In the next section, we discuss the “causal hierarchy”. The Causal Hierarchy Pearl (2019a) introduces a causal hierarchy with three levels — association, intervention, and counterfactuals — as well as three prototypical actions corresponding to each level — seeing, doing, and imagining. In the remainder of this blog post, we will tackle each level in turn. Seeing Association is on the most basic level; it makes us see that two or more things are somehow related. Importantly, we need to distinguish between marginal associations and conditional associations. The latter are the key building block of causal inference. The figure below illustrates these two concepts. If we look at the whole, aggregated data on the left we see that the continuous variables $X$ and $Y$ are positively correlated: an increase in values for $X$ co-occurs with an increase in values for $Y$. This relation describes the marginal association of $X$ and $Y$ because we do not care whether $Z = 0$ or $Z = 1$. On the other hand, if we condition on the binary variable $Z$, we find that there is no relation: $X \perp Y \mid Z$. (For more on marginal and conditional associations in the case of Gaussian distributions, see this blog post). In the next section, we discuss a powerful tool that allows us to visualize such dependencies. Directed Acyclic Graphs We can visualize the statistical dependencies between the three variables using a graph. A graph is a mathematical object that consists of nodes and edges. In the case of Directed Acyclic Graphs (DAGs), these edges are directed. We take our variables $(X, Y, Z$) to be nodes in such DAG and we draw (or omit) edges between these nodes so that the conditional (in)dependence structure in the data is reflected in the graph. We will explain this more formally shortly. For now, let’s focus on the relationship between the three variables. We have seen that $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. It turns out that we can draw three DAGs that encode this fact; these are the first three DAGs in the figure below. $X$ and $Y$ are dependent through $Z$ in these graphs, and conditioning on $Z$ blocks the path between $X$ and $Y$. We state this more formally shortly. While it is natural to interpret the arrows causally, we do not do so here. For now, the arrows are merely tools that help us describe associations between variables. The figure above also shows a fourth DAG, which encodes a different set of conditional (in)dependence relations between $X$, $Y$, and $Z$. The figure below illustrates this: looking at the aggregated data we do not find a relation between $X$ and $Y$ — they are marginally independent — but we do find one when looking at the disaggregated data — $X$ and $Y$ are conditionally dependent given $Z$. A real-world example might help build intuition: Looking at people who are single and who are in a relationship as a separate group, being attractive ($X$) and being intelligent ($Y$) are two independent traits. This is what we see in the left panel in the figure above. Let’s make the reasonable assumption that both being attractive and being intelligent are positively related with being in a relationship. What does this imply? First, it implies that, on average, single people are less attractive and less intelligent (see red data points). Second, and perhaps counter-intuitively, it implies that in the population of single people (and people in a relationship, respectively), being attractive and being intelligent are negatively correlated. After all, if the handsome person you met at the bar were also intelligent, then he would most likely be in a relationship! In this example, visualized in the fourth DAG, $Z$ is commonly called a collider. Suppose we want to estimate the association between $X$ and $Y$ in the whole population. Conditioning on a collider (for example, by only analyzing data from people who are not in a relationship) while computing the association between $X$ and $Y$ will lead to a different estimate, and the induced bias is known as collider bias. It is a serious issue not only in dating, but also for example in medicine. The simple graphs shown above are the building blocks of more complicated graphs. In the next section, we describe a tool that can help us find (conditional) independencies between sets of variables.3 $d$-separation For large graphs, it is not obvious how to conclude that two nodes are (conditionally) independent. d-separation is a tool that allows us to check this algorithmically. We need to define some concepts: A path from $X$ to $Y$ is a sequence of nodes and edges such that the start and end nodes are $X$ and $Y$, respectively. A conditioning set $\mathcal{L}$ is the set of nodes we condition on (it can be empty). A collider along a path blocks that path. However, conditioning on a collider (or any of its descendants) unblocks that path. With these definitions out of the way, we call two nodes $X$ and $Y$ $d$-separated by $\mathcal{L}$ if conditioning on all members in $\mathcal{L}$ blocks all paths between the two nodes. If this is your first encounter with $d$-separation, then this is a lot to wrap your head around. To get some practice, look at the graph on the left side. First, note that there are no marginal dependencies; this means that without conditioning or blocking nodes, any two nodes are connected by a path. For example, there is a path going from $X$ to $Y$ through $Z$, and there is a path from $V$ to $U$ going through $Y$ and $W$. However, there are a number of conditional independencies. For example, $X$ and $Y$ are conditionally independent given $Z$. Why? There are two paths from $X$ to $Y$: one through $Z$ and one through $W$. However, since $W$ is a collider on the path from $X$ to $Y$, the path is already blocked. The only unblocked path from $X$ to $Y$ is through $Z$, and conditioning on it therefore blocks all remaining open paths. Additionally conditioning on $W$ would unblock one path, and $X$ and $Y$ would again be associated. So far, we have implicitly assumed that conditional (in)dependencies in the graph correspond to conditional (in)dependencies between variables. We make this assumption explicit now. In particular, note that d-separation provides us with an independence model $\perp_{\mathcal{G}}$ defined on graphs. To connect this to our standard probabilistic independence model $\perp_{\mathcal{P}}$ defined on random variables, we assume the following Markov property: \[X \perp_{\mathcal{G}} Y \mid Z \implies X \perp_{\mathcal{P}} Y \mid Z \enspace .\] In words, we assume that if the nodes $X$ and $Y$ are d-separated by $Z$ in the graph $\mathcal{G}$, the corresponding random variables $X$ and $Y$ are conditionally independent given $Z$. This implies that all conditional independencies in the data are represented in the graph.4 Moreover, the statement above implies (and is implied by) the following factorization: \[p(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace ,\] where $\text{pa}^{\mathcal{G}}(X_i)$ denotes the parents of the node $X_i$ in graph $\mathcal{G}$ (see Peters, Janzing, & Schölkopf, p. 101). A node is a parent of another node if it has an outgoing arrow to that node; for example, $X$ is a parent of $Z$ and $W$ in the graph above. The above factorization implies that a node $X$ is independent of its non-descendants given its parents. d-separation is an extremely powerful tool. Until now, however, we have only looked at DAGs to visualize (conditional) independencies. In the next section, we go beyond seeing to doing. Doing We do not merely want to see the world, but also change it. From this section on, we are willing to interpret DAGs causally. As Dawid (2009a) warns, this is a serious step. In merely describing conditional independencies — seeing — the arrows in the DAG played a somewhat minor role, being nothing but “incidental construction features supporting the $d$-separation semantics” (Dawid, 2009a, p. 66). In this section, we endow the DAG with a causal meaning and interpret the arrows as denoting direct causal effects. What is a causal effect? Following Pearl and others, we take an interventionist position and say that a variable $X$ has a causal influence on $Y$ if changing $X$ leads to changes in $Y$. This position is a very useful one in practice, but not everybody agrees with it (e.g., Cartwright, 2007). The figure below shows the observational DAGs from above (top row) as well as the manipulated DAGs (bottom row) where we have intervened on the variable $X$, that is, set the value of the random variable $X$ to a constant $x$. Note that setting the value of $X$ cuts all incoming causal arrows since its value is thereby determined only by the intervention, not by any other factors. As is easily verified with $d$-separation, the first three graphs in the top row encode the same conditional independence structure. This implies that we cannot distinguish them using only observational data. Interpreting the edges causally, however, we see that the DAGs have a starkly different interpretation. The bottom row makes this apparent by showing the result of an intervention on $X$. In the leftmost causal DAG, $Z$ is on the causal path from $X$ to $Y$, and intervening on $X$ therefore influences $Y$ through $Z$. In the DAG next, to it $Z$ is on the causal path from $Y$ to $X$, and so intervening on $X$ does not influence $Y$. In the third DAG, $Z$ is a common cause and — since there is no other path from $X$ to $Y$ — intervening on $X$ does not influence $Y$. For the collider structure in the rightmost DAG, intervening on $X$ does not influence $Y$ because there is no unblocked path from $X$ to $Y$. To make the distinction between seeing and doing, Pearl introduced the do-operator. While $p(Y \mid X = x)$ denotes the observational distribution, which corresponds to the process of seeing, $p(Y \mid do(X = x))$ corresponds to the interventional distribution, which corresponds to the process of doing. The former describes what values $Y$ would likely take on when $X$ happened to be $x$, while the latter describes what values $Y$ would likely take on when $X$ would be set to $x$. Computing causal effects $P(Y \mid do(X = x))$ describes the causal effect of $X$ on $Y$, but how do we compute it? Actually doing the intervention might be unfeasible or unethical — side-stepping actual interventions and still getting at causal effects is the whole point of this approach to causal inference. We want to learn causal effects from observational data, and so all we have is the observational DAG. The causal quantity, however, is defined on the manipulated DAG. We need to build a bridge between the observational DAG and the manipulated DAG, and we do this by making two assumptions. First, we assume that interventions are local. This means that if I set $X = x$, then this only influences the variable $X$, with no other direct influence on any other variable. Of course, intervening on $X$ will influence other variables, but only through $X$, not directly through us intervening. In colloquial terms, we do not have a “fat hand”, but act like a surgeon precisely targeting only a very specific part of the DAG; we say that the DAG is composed of modular parts. We can encode this using the factorization property above: \[p(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace ,\] which we now interpret causally. The factors in the product are sometimes called causal Markov kernels; they constitute the modular parts of the system. Second, we assume that the mechanism by which variables interact do not change through interventions; that is, the mechanism by which a cause brings about its effects does not change whether this occurs naturally or by intervention (see e.g., Pearl, Glymour, & Jewell, p. 56). With these two assumptions in hand, further note that $p(Y \mid do(X = x))$ can be understood as the observational distribution in the manipulated DAG — $p_m(Y \mid X = x)$ — that is, the DAG where we set $X = x$. This is because after doing the intervention (which catapults us into the manipulated DAG), all that is left for us to do is to see its effect. Observe that the leftmost and rightmost DAG above remain the same under intervention on $X$, and so the interventional distribution $p(Y \mid do(X = x))$ is just the conditional distribution $p(Y \mid X = x)$. The middle DAGs require a bit more work: \[\begin{aligned} p(Y = y \mid do(X = x)) &= p_{m}(Y = y \mid X = x) \\[.5em] &= \sum_{z} p_{m}(Y = y, Z = z \mid X = x) \\[.5em] &= \sum_{z} p_{m}(Y = y \mid X = x, Z = z) \, p_m(Z = z) \\[.5em] &= \sum_{z} p(Y = y \mid X = x, Z = z) \, p(Z = z) \enspace . \end{aligned}\] The first equality follows by definition. The second and third equality follow from the sum and product rule of probability. The last line follows from the assumption that the mechanism through which $X$ influences $Y$ is independent of whether we set $X$ or whether $X$ naturally occurs, that is, $p_{m}(Y = y \mid X = x, Z = z) = p(Y = y \mid X = x, Z = z)$, and the assumption that interventions are local, that is, $p_m(Z = z) = p(Z = z)$. Thus, the interventional distribution we care about is equal to the conditional distribution of $Y$ given $X$ when we adjust for $Z$. Graphically speaking, this blocks the path $X \leftarrow Z \leftarrow Y$ in the left middle graph and the path $X \leftarrow Z \rightarrow Y$ in the right middle graph. If there were a path $X \rightarrow Y$ in these two latter graphs, and if we would not adjust for $Z$, then the causal effect of $X$ on $Y$ would be confounded. For these simple DAGs, however, it is already clear from the fact that $X$ is independent of $Y$ given $Z$ that $X$ cannot have a causal effect on $Y$. In the next section, study a more complicated graph and look at confounding more closely. Confounding Confounding has been given various definitions over the decades, but usually denotes the situation where a (possibly unobserved) common cause obscures the causal relationship between two or more variables. Here, we are slightly more precise and call a causal effect of $X$ on $Y$ confounded if $p(Y \mid X = x) \neq p(Y \mid do(X = x))$, which also implies that collider bias is a type of confounding. This occured in the middle two DAGs in the example above, as well as in the chocolate consumption and Nobel Laureates example at the beginning of the blog post. Confounding is the bane of observational data analysis. Helpfully, causal DAGs provide us with a tool to describe multivariate relations between variables. Once we have stated our assumptions clearly, the do-calculus further provides us with a means to know what variables we need to adjust for so that causal effects are unconfounded. We follow Pearl, Glymour, & Jewell (2016, p. 61) and define the backdoor criterion: Given two nodes $X$ and $Y$, an adjustment set $\mathcal{L}$ fulfills the backdoor criterion if no member in $\mathcal{L}$ is a descendant of $X$ and members in $\mathcal{L}$ block all paths between $X$ and $Y$. Adjusting for $\mathcal{L}$ thus yields the causal effect of $X \rightarrow Y$. The key observation is that this (a) blocks all spurious, that is, non-causal paths between $X$ and $Y$, (b) leaves all directed paths from $X$ to $Y$ unblocked, and (c) creates no spurious paths. To see this action, let’s again look at the DAG on the left. The causal effect of $Z$ on $U$ is confounded by $X$, because in addition to the legitimate causal path $Z \rightarrow Y \rightarrow W \rightarrow U$, there is also an unblocked path $Z \leftarrow X \rightarrow W \rightarrow U$ which confounds the causal effect. The backdoor criterion would have us condition on $X$, which blocks the spurious path and renders the causal effect of $Z$ on $U$ unconfounded. Note that conditioning on $W$ would also block this spurious path; however, it would also block the causal path $Z \rightarrow Y \rightarrow W \rightarrow U$. Before moving on, let’s catch a quick breath. We have already discussed a number of very important concepts. At the lowest level of the causal hierarchy — association — we have discovered DAGs and $d$-separation as a powerful tool to reason about conditional (in)dependencies between variables. Moving to intervention, the second level of the causal hierarchy, we have satisfied our need to interpret the arrows in a DAG causally. Doing so required strong assumptions, but it allowed us to go beyond seeing and model the outcome of interventions. This hopefully clarified the notion of confounding. In particular, collider bias is a type of confounding, which has important practical implications: we should not blindly enter all variables into a regression in order to “control” for them, but think carefully about what the underlying causal DAG could look like. Otherwise, we might induce spurious associations. The concepts from causal inference can help us understand methodological phenomena that have been discussed for decades. In the next section, we apply the concepts we have seen so far to make sense of one such phenomenon: Simpson’s Paradox. Example Application: Simpson’s Paradox This section follows the example given in Pearl, Glymour, & Jewell (2016, Ch. 1) with slight modifications. Suppose you observe $N = 700$ patients who either choose to take a drug or not; note that this is not a randomized control trial. The table below shows the number of recovered patients split across sex. The content of this blog post is an extended write-up of a one-hour lecture I gave to $3^{\text{rd}}$ year psychology undergraduate students at the University of Amsterdam. You can view the presentation, which includes exercises at the end, here. ↩ Messerli (2012) was the first to look at this relationship. The data I present here are somewhat different. I include Nobel Laureates up to 2019, and I use the 2017 chocolate consumption data as reported here. You can download the data set here. To get the data reported by Messerli (2012) into R, you can follow this blogpost. ↩ I can recommend this course on causal diagrams by Miguel Hernán to get more intuition for causal graphical models. ↩ If the converse implication, that is, the implication from the distribution to the graph holds, we say that the graph is faithful to the distribution. This is an important assumption in causal learning because it allows one to estimate causal relations from conditional independencies in the data. ↩A brief primer on Variational Inference2019-10-30T12:00:00+00:002019-10-30T12:00:00+00:00https://fabiandablander.com/r/Variational-Inference<link rel="stylesheet" href="../highlight/styles/default.css" />
<script src="../highlight/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<script>$('pre.stan code').each(function(i, block) {hljs.highlightBlock(block);});</script>
<p>Bayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let’s start unpacking!</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>Bayes’ rule states that</p>
\[\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,\]
<p>where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup> Bayes’ rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much <a href="https://xkcd.com/2117/">easier</a> and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization.</p>
<h1 id="variational-objective">Variational objective</h1>
<p>We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution:</p>
\[q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace ,\]
<p>where $\text{KL}(. \lvert \lvert.)$ denotes the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a>:</p>
\[\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) \, \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace .\]
<p>We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that:</p>
\[\begin{aligned}
\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z} \mid \mathbf{x})\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x})\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \int q(\mathbf{z}) \, \text{log } p(\mathbf{x}) \, \mathrm{d}\mathbf{z} \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x})}_{\text{Nemesis}} \enspace ,
\end{aligned}\]
<p>where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity:</p>
\[\begin{aligned}
\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x}) \right) \\[.5em]
&= -\left(\mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x}) - \text{log } p(\mathbf{x})}_{\text{Nemesis perishes}}\right) \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \enspace .
\end{aligned}\]
<p>We can expand the joint probability to get more insight into this equation:</p>
\[\begin{aligned}
\text{ELBO}(q) &= \underbrace{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z})\right]}_{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]} - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z})}{q(\mathbf{z})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z})\right) \enspace .
\end{aligned}\]
<p>This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017).</p>
<p>ELBO stands for <em>evidence lower bound</em>. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence:</p>
\[\begin{aligned}
\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x})\right) \\[.5em]
\text{log } p(\mathbf{x}) &= \text{ELBO}(q) + \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \\[.5em]
\text{log } p(\mathbf{x}) &\geq \text{ELBO}(q) \enspace ,
\end{aligned}\]
<p>since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see <a href="https://fabiandablander.com/r/Law-of-Practice.html">this</a> and <a href="https://fabiandablander.com/r/Bayes-Potter.html">this</a> blog post.</p>
<h1 id="why-variational">Why variational?</h1>
<p>Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the <em>single best value</em> that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">this</a> blog post). In our setting, we do not want to find a single best value but rather a <em>single best function</em>. To do this, we can use <em>variational calculus</em> from which variational inference derives its name (Bishop, 2006, p. 462).</p>
<p>A function takes an input value and returns an output value. We can define a <em>functional</em> which takes a whole function and returns an output value. The <em>entropy</em> of a probability distribution is a widely used functional:</p>
\[\text{H}[p] = \int p(x) \, \text{log } p(x) \mathrm{d} x \enspace ,\]
<p>which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional.</p>
<p>In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem.</p>
<h1 id="mean-field-variational-family">Mean-field variational family</h1>
<p>A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density:</p>
\[q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j) \enspace .\]
<p>Note that this <em>mean-field variational family</em> cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem.</p>
<p>We start from our definition of the ELBO and apply the mean-field assumption:</p>
\[\begin{aligned}
\text{ELBO}(q) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]
&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z}\enspace .
\end{aligned}\]
<p>In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed:</p>
\[\begin{aligned}
\text{ELBO}(q_j) &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z} \\[.5em]
&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j - \underbrace{\int \prod_{i\neq j}^m q_i(z_i) \, \text{log} \prod_{i\neq j}^m q_i(z_i) \, \mathrm{d}\mathbf{z}_{-j}}_{\text{Constant with respect to } q_j(z_j)} \\[.5em]
&\propto \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \left(\int \prod_{i\neq j}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z}_{-j}\right)\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \, \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \enspace .
\end{aligned}\]
<p>One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution</p>
\[\text{log } \tilde{p}{(\mathbf{x}, z_j)} = \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathcal{Z} \enspace ,\]
<p>where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that:</p>
\[\begin{aligned}
\text{ELBO}(q_j) &\propto \int q_j(z_j) \, \text{log } \tilde{p}{(\mathbf{x}, z_j)} \, \mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \, \text{log } \frac{\tilde{p}{(\mathbf{x}, z_j)}}{q_j(z_j)} \, \mathrm{d}z_j \\[.5em]
&= -\int q_j(z_j) \, \text{log } \frac{q_j(z_j)}{\tilde{p}{(\mathbf{x}, z_j)}} \, \mathrm{d}z_j \\[.5em]
&= -\text{KL}\left(q_j(z_j) \, \lvert\lvert \, \tilde{p}(\mathbf{x}, z_j)\right) \enspace .
\end{aligned}\]
<p>Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by:</p>
\[\begin{aligned}
q_j^\star(z_j) &= \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right] - \mathcal{Z}\right) \\[.5em]
&= \frac{\text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right)}{\int \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right) \mathrm{d}z_j} \enspace ,
\end{aligned}\]
<p>see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as <em>Coordinate Ascent Variational Inference</em> (CAVI). Further, note that</p>
\[p(z_j \mid \mathbf{z}_{-j}, \mathbf{x}) = \frac{p(z_j, \mathbf{z}_{-j}, \mathbf{x})}{p(\mathbf{z}_{-j}, \mathbf{x})} \propto p(z_j, \mathbf{z}_{-j}, \mathbf{x}) \enspace ,\]
<p>which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks <em>a lot</em> like Gibbs sampling, which we discussed in detail in a <a href="https://fabiandablander.com/r/Spike-and-Slab.html">previous</a> blog post. In the next section, we implement CAVI for a simple linear regression problem.</p>
<h1 id="application-linear-regression">Application: Linear regression</h1>
<p>In a <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">previous</a> blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is:</p>
\[\begin{aligned}
y &\sim \mathcal{N}(\beta x, \sigma^2) \\[.5em]
\beta &\sim \mathcal{N}(0, \sigma^2 \tau^2) \\[.5em]
\sigma^2 &\propto \frac{1}{\sigma^2} \enspace ,
\end{aligned}\]
<p>where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys’ prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification:</p>
\[\text{Var}\left[\frac{\beta}{\sigma}\right] = \frac{1}{\sigma^2} \text{Var}[\beta] = \frac{\sigma^2 \tau^2}{\sigma^2} = \tau^2 \enspace .\]
<p>As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$.</p>
<h2 id="variational-density-for-sigma2">Variational density for $\sigma^2$</h2>
<p>Our optimal variational density $q^\star(\sigma^2)$ is given by:</p>
\[q^\star(\sigma^2) \propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \enspace .\]
<p>To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write:</p>
\[\begin{aligned}
p(\sigma^2 \mid \mathbf{y}, \beta) &\propto p(\mathbf{y} \mid \sigma^2, \beta) \, p(\beta) \, p(\sigma^2) \\[.5em]
&= \prod_{i=1}^n (2\pi)^{-\frac{1}{2}} \left(\sigma^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2} \left(y_i - \beta x_i\right)^2\right) \underbrace{(2\pi)^{-\frac{1}{2}} \left(\sigma^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}_{p(\beta)} \underbrace{\left(\sigma^2\right)^{-1}}_{p(\sigma^2)} \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&\propto\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{2\sigma^2} \underbrace{\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)}_{A}\right) \enspace ,
\end{aligned}\]
<p>which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write:</p>
\[\begin{aligned}
q^\star(\sigma^2) &\propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{2\sigma^2}A \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1}\right] - \mathbb{E}_{q(\beta)}\left[\frac{1}{2\sigma^2}A \right]\right) \\[.5em]
&= \text{exp}\left(\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \\[.5em]
&= \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \enspace .
\end{aligned}\]
<p>This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at:</p>
\[q^\star(\sigma^2)= \frac{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \underbrace{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]}_{\nu}\right) \enspace .\]
<p>Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$.</p>
<h2 id="variational-density-for-beta">Variational density for $\beta$</h2>
<p>Our optimal variational density $q^\star(\beta)$ is given by:</p>
\[q^\star(\beta) \propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \enspace ,\]
<p>and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write:</p>
\[\begin{aligned}
p(\beta \mid \mathbf{y}, \sigma^2) &\propto p(\mathbf{y} \mid \beta, \sigma^2) \, p(\beta) \, p(\sigma^2) \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^ny_i^2- 2 \beta \sum_{i=1}^n y_i x_i + \beta^2 \sum_{i=1}^n x_i^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&\propto \text{exp}\left(-\frac{1}{2\sigma^2} \left( \beta^2 \left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) - 2 \beta \sum_{i=1}^n y_i x_i\right)\right) \\[.5em]
&=\text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta^2 - 2 \beta \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)\right) \\[.5em]
&\propto \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,
\end{aligned}\]
<p>where we have “completed the square” (see also <a href="https://fabiandablander.com/statistics/Two-Properties.html">this</a> blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations:</p>
\[\begin{aligned}
q^\star(\beta) &\propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right]\right) \\[.5em]
&= \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,
\end{aligned}\]
<p>which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields:</p>
\[q^\star(\beta) = \left(2\pi\underbrace{\frac{\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}}_{\sigma^2_{\beta}}\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left(\beta - \underbrace{\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}}_{\mu_{\beta}}\right)^2\right) \enspace ,\]
<p>Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not.</p>
<p>To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys’s prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations.</p>
<h2 id="removing-expectations">Removing expectations</h2>
<p>Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, to remove the expectation in the variational density for $\sigma^2$, we write:</p>
\[\begin{aligned}
\mathbb{E}_{q(\beta)}\left[A \right] &= \mathbb{E}_{q(\beta)}\left[\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right] \\[.5em]
&= \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mathbb{E}_{q(\beta)}\left[\beta\right] + \sum_{i=1}^n x_i^2 \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] + \frac{1}{\tau^2} \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] \enspace .
\end{aligned}\]
<p>Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that:</p>
\[\mathbb{E}_{q(\beta)}[\beta^2] = \text{Var}_{q(\beta)}\left[\beta\right] + \mathbb{E}_{q(\beta)}[\beta]^2
= \sigma^2_{\beta} + \mu_{\beta}^2 \enspace ,\]
<p>the expectation becomes:</p>
\[\mathbb{E}_{q(\beta)}\left[A\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) \enspace .\]
<p>For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. We write:</p>
\[\begin{aligned}
\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] &= \int \frac{1}{\sigma^2}\frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right) \mathrm{d}\sigma^2\\[0.50em]
&= \frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)} \int \left(\sigma^2\right)^{-\left(\frac{n + 1}{2} + 1\right) - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right) \mathrm{d}\sigma^2 \\[0.50em]
&= \frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)} \frac{\Gamma\left(\frac{n + 1}{2} + 1\right)}{\nu^{\frac{n + 1}{2} + 1}} \\[0.50em]
&= \frac{n + 1}{2} \left(\frac{1}{2}\mathbb{E}_{q(\beta)}\left[A \right]\right)^{-1} \\[.5em]
&= \frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \enspace .
\end{aligned}\]
<h2 id="monitoring-convergence">Monitoring convergence</h2>
<p>The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write:</p>
\[\begin{aligned}
\text{ELBO}(q) &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y}, \beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2) \right] \\[.5em]
&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \mathbb{E}_{p(\beta, \sigma^2)}\left[\text{log } p(\beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2)\right] \\[.5em]
&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \underbrace{\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right]}_{-\text{KL}\left(q(\beta, \sigma^2) \, \lvert\lvert \, p(\beta, \sigma^2)\right)}\enspace .
\end{aligned}\]
<p>Let’s take a deep breath and tackle the second term first:</p>
\[\begin{aligned}
\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{p(\beta \mid \sigma^2)}{q(\beta)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\left(2\pi\sigma^2\tau^2\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}{\left(2\pi\sigma^2_\beta\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2_\beta} (\beta - \mu_\beta)^2\right)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\frac{1}{\sigma^2\tau^2} \beta^2}{\frac{1}{\sigma^2_\beta} (\beta - \mu_\beta)^2}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\text{log}\frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\sigma^2_\beta + \mu_\beta^2}{\sigma^2\tau^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \text{log}\frac{\tau^2}{\sigma^2_\beta} + \mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \text{log}\frac{\tau^2}{\sigma^2_\beta} + \mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\sigma^2)\right] - \mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]\enspace .
\end{aligned}\]
<p>Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration:</p>
<!-- We proceed by expanding the last expectation: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{\sigma^{-2}}{\frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right)}\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{1}{\left(\sigma^2\right)^{-\frac{n + 1}{2}}} - \frac{\sigma^2}{\nu}\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\nu}\mathbb{E}_{q(\sigma^2)}\left[\sigma^2\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\nu} \frac{\nu}{\frac{n + 1}{2} - 1} \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\frac{n + 1}{2} - 1} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
\[\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right] = \int q(\sigma^2) \, \text{log } q(\sigma^2) \, \mathrm{d}\sigma^2 \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\text{log } q(\sigma^2)}_{\sigma^2 \, \sim \, q(\sigma^2)} \enspace ,\]
<p>We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient:</p>
\[\begin{aligned}
\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] &= \int \int q(\beta) \, q(\sigma^2) \, \text{log } p(\mathbf{y} \mid \beta, \sigma^2) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]
&=\int q(\beta) \int q(\sigma^2) \, \text{log} \left(\left(2\pi\sigma^2\right)^{-\frac{n}{2}}\text{exp}\left(-\frac{1}{2\sigma^2}
\sum_{i=1}^n (y_i - x_i\beta)^2\right)\right) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]
&= \frac{n}{4} \text{log}\left(2\pi\right)\int q(\beta) \left(\sum_{i=1}^n (y_i - x_i\beta)^2\right) \, \mathrm{d}\beta\int q(\sigma^2) \, \, \text{log} \left(\sigma^2\right)\frac{1}{\sigma^2} \, \mathrm{d}\sigma \enspace .
\end{aligned}\]
<p>Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically:</p>
\[\mathbb{E}_{q(\beta)}\left[\sum_{i=1}^n (y_i - x_i\beta)^2\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right) \enspace .\]
<!-- Piecing it all together, the ELBO is given by: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{ELBO}(\mu_\beta, \sigma_\beta^2, \tau^2, \tau^2) &= \frac{n}{4} \text{log}\left(2\pi\right)\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right)\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log} \left(\sigma^2\right)\frac{1}{\sigma^2}\right]\\[.5em] -->
<!-- &+ \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \\[.5em] -->
<!-- &+ \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\frac{n + 1}{2} - 1} \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<p>In the next section, we implement the algorithm for our linear regression problem in R.</p>
<h1 id="implementation-in-r">Implementation in R</h1>
<p>Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'MCMCpack'</span><span class="p">)</span><span class="w">
</span><span class="cd">#' Computes the ELBO for the linear regression example</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @param y univariate outcome variable</span><span class="w">
</span><span class="cd">#' @param x univariate predictor variable</span><span class="w">
</span><span class="cd">#' @param beta_mu mean of the variational density for \beta</span><span class="w">
</span><span class="cd">#' @param beta_sd standard deviation of the variational density for \beta</span><span class="w">
</span><span class="cd">#' @param nu parameter of the variational density for \sigma^2</span><span class="w">
</span><span class="cd">#' @param nr_samples number of samples for the Monte carlo integration</span><span class="w">
</span><span class="cd">#' @returns ELBO</span><span class="w">
</span><span class="n">compute_elbo</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e4</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_y2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_yx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="c1"># Takes a function and computes its expectation with respect to q(\beta)</span><span class="w">
</span><span class="n">E_q_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dnorm</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">fn</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="o">-</span><span class="kc">Inf</span><span class="p">,</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Takes a function and computes its expectation with respect to q(\sigma^2)</span><span class="w">
</span><span class="n">E_q_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">fn</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Compute expectations of log p(\sigma^2)</span><span class="w">
</span><span class="n">E_log_p_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="c1"># Compute expectations of log p(\beta \mid \sigma^2)</span><span class="w">
</span><span class="n">E_log_p_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="w">
</span><span class="nf">log</span><span class="p">(</span><span class="n">tau2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="p">(</span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">tau2</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">tau2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># Compute expectations of the log variational densities q(\beta)</span><span class="w">
</span><span class="n">E_log_q_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_beta</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">log</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="c1"># E_log_q_sigma2 <- E_q_sigma2(function(x) log(dinvgamma(x, (n + 1)/2, nu))) # fails</span><span class="w">
</span><span class="c1"># Compute expectations of the log variational densities q(\sigma^2)</span><span class="w">
</span><span class="n">sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rinvgamma</span><span class="p">(</span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w">
</span><span class="n">E_log_q_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">sigma2</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)))</span><span class="w">
</span><span class="c1"># Compute the expected log likelihood</span><span class="w">
</span><span class="n">E_log_y_b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_y2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">sum_yx</span><span class="o">*</span><span class="n">beta_mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_mu</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="n">sum_x2</span><span class="w">
</span><span class="n">E_log_y_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">E_log_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">4</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_log_y_b</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_log_y_sigma2</span><span class="w">
</span><span class="c1"># Compute and return the ELBO</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_log_y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E_log_p_beta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E_log_p_sigma2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">E_log_q_beta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">E_log_q_sigma2</span><span class="w">
</span><span class="n">ELBO</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The function below implements coordinate ascent mean-field variational inference for our simple linear regression problem. Recall that the variational parameters are:</p>
\[\begin{aligned}
\nu &= \frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right) \\[.5em]
\mu_\beta &= \frac{\sum_{i=1}^N y_i x_i}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \\[.5em]
\sigma^2_\beta &= \frac{\left(\frac{n + 1}{2}\right) \nu^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \enspace .
\end{aligned}\]
<p>The following function implements the iterative updating of these variational parameters until the ELBO has converged.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="cd">#' Implements CAVI for the linear regression example</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @param y univariate outcome variable</span><span class="w">
</span><span class="cd">#' @param x univariate predictor variable</span><span class="w">
</span><span class="cd">#' @param tau2 prior variance for the standardized effect size</span><span class="w">
</span><span class="cd">#' @returns parameters for the variational densities and ELBO</span><span class="w">
</span><span class="n">lmcavi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e5</span><span class="p">,</span><span class="w"> </span><span class="n">epsilon</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e-2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_y2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_yx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="c1"># is not being updated through variational inference!</span><span class="w">
</span><span class="n">beta_mu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_yx</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_mu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">beta_mu</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">has_converged</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">epsilon</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">compute_elbo</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w">
</span><span class="c1"># while the ELBO has not converged</span><span class="w">
</span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">has_converged</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]][</span><span class="n">j</span><span class="p">],</span><span class="w"> </span><span class="n">ELBO</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">nu_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]][</span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="n">beta_sd_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]][</span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="c1"># used in the update of beta_sd and nu</span><span class="w">
</span><span class="n">E_qA</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_y2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">sum_yx</span><span class="o">*</span><span class="n">beta_mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">beta_sd_prev</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_mu</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">)</span><span class="w">
</span><span class="c1"># update the variational parameters for sigma2 and beta</span><span class="w">
</span><span class="n">nu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_qA</span><span class="w">
</span><span class="n">beta_sd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(((</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">E_qA</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">))</span><span class="w">
</span><span class="c1"># update results object</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]],</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]],</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]],</span><span class="w"> </span><span class="n">ELBO</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute new ELBO</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">compute_elbo</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Let’s run this on a simulated data set of size $n = 100$ with a true coefficient of $\beta = 0.30$ and a true error variance of $\sigma^2 = 1$. We assign $\beta$ a Gaussian prior with variance $\tau^2 = 0.25$ so that values for $\lvert \beta \rvert$ larger than two standard deviations ($0.50$) <a href="(https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)">receive about $0.68$</a> prior probability.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gen_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="o">*</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gen_dat</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">0.30</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">mc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lmcavi</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">mc</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## $nu
## [1] 5.00000 88.17995 45.93875 46.20205 46.19892 46.19895
##
## $beta_mu
## [1] 0.2800556
##
## $beta_sd
## [1] 1.00000000 0.08205605 0.11368572 0.11336132 0.11336517 0.11336512
##
## $ELBO
## [1] 0.0000 -297980.0495 493.4807 -281.4578 -265.1289 -265.3197</code></pre></figure>
<p>From the output, we see that the ELBO and the variational parameters have converged. In the next section, we compare these results to results obtained with Stan.</p>
<h2 id="comparison-with-stan">Comparison with Stan</h2>
<p>Whenever one goes down a rabbit hole of calculations, it is good to sanity check one’s results. Here, we use Stan’s variational inference scheme to check whether our results are comparable. It assumes a Gaussian variational density for each parameter after transforming them to the real line and automates inference in a “black-box” way so that no problem-specific calculations are required (see Kucukelbir, Ranganath, Gelman, & Blei, 2015). Subsequently, we compare our results to the exact posteriors arrived by Markov chain Monte carlo. The simple linear regression model in Stan is:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">data {
int<lower=0> n;
vector[n] y;
vector[n] x;
real tau;
}
parameters {
real b;
real<lower=0> sigma;
}
model {
target += -log(sigma);
target += normal_lpdf(b | 0, sigma*tau);
target += normal_lpdf(y | b*x, sigma);
}</code></pre></figure>
<p>We use Stan’s black-box variational inference scheme:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'rstan'</span><span class="p">)</span><span class="w">
</span><span class="c1"># save the above model to a file and compile it</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stan_model</span><span class="p">(</span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'stan-compiled/variational-regression.stan'</span><span class="p">)</span><span class="w">
</span><span class="n">stan_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">dat</span><span class="p">),</span><span class="w"> </span><span class="s1">'x'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="s1">'tau'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vb</span><span class="p">(</span><span class="w">
</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_dat</span><span class="p">,</span><span class="w"> </span><span class="n">output_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20000</span><span class="p">,</span><span class="w"> </span><span class="n">adapt_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w">
</span><span class="n">init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'b'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.30</span><span class="p">,</span><span class="w"> </span><span class="s1">'sigma'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>This gives similar estimates as ours:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Inference for Stan model: variational-regression.
## 1 chains, each with iter=20000; warmup=0; thin=1;
## post-warmup draws per chain=20000, total post-warmup draws=20000.
##
## mean sd 2.5% 25% 50% 75% 97.5%
## b 0.28 0.13 0.02 0.19 0.28 0.37 0.54
## sigma 0.99 0.09 0.82 0.92 0.99 1.05 1.18
## lp__ 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## Approximate samples were drawn using VB(meanfield) at Thu Mar 19 10:45:28 2020.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## We recommend genuine 'sampling' from the posterior distribution for final inferences!</code></pre></figure>
<p>Their recommendation is prudent. If you run the code with different seeds, you can get quite different results. For example, the posterior mean of $\beta$ can range from $0.12$ to $0.45$, and the posterior standard deviation can be as low as $0.03$; in all these settings, Stan indicates that the ELBO has converged, but it seems that it has converged to a different local optimum for each run. (For seed = 3, Stan gives completely nonsensical results). Stan warns that the algorithm is experimental and may be unstable, and it is probably wise to not use it in production.</p>
<p><em>Update</em>: As Ben Goodrich points out in the comments, there is some cool work on providing diagnostics for variational inference; see <a href="https://statmodeling.stat.columbia.edu/2018/06/27/yes-work-evaluating-variational-inference/">this</a> blog post and the paper by Yao, Vehtari, Simpson, & Gelman (<a href="https://arxiv.org/abs/1802.02538">2018</a>) as well as the paper by Huggins, Kasprzak, Campbell, & Broderik (<a href="https://arxiv.org/abs/1910.04102">2019</a>).</p>
<p>Although the posterior distribution for $\beta$ and $\sigma^2$ is available in closed-form (see the <em>Post Scriptum</em>), we check our results against exact inference using Markov chain Monte carlo by visual inspection.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_dat</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<p>The Figure below overlays our closed-form results to the histogram of posterior samples obtained using Stan.</p>
<p><img src="/assets/img/2019-10-30-Variational-Inference.Rmd/unnamed-chunk-10-1.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" style="display: block; margin: auto;" /></p>
<p>Note that the posterior variance of $\beta$ is slightly <em>overestimated</em> when using our variational scheme. This is in contrast to the fact that variational inference generally <em>underestimates</em> variances. Note also that Bayesian inference using Markov chain Monte Carlo is very fast on this simple problem. However, the comparative advantage of variational inference becomes clear by increasing the sample size: for sample sizes as large as $n = 100000$, our variational inference scheme takes less then a tenth of a second!</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen how to turn an integration problem into an optimization problem using variational inference. Assuming that the variational densities are independent, we have derived the optimal variational densities for a simple linear regression problem with one predictor. While using variational inference for this problem is unnecessary since everything is available in closed-form, I have focused on such a simple problem so as to not confound this introduction to variational inference by the complexity of the model. Still, the derivations were quite lengthy. They were also entirely specific to our particular problem, and thus generic “black-box” algorithms which avoid problem-specific calculations hold great promise.</p>
<p>We also implemented coordinate ascent mean-field variational inference (CAVI) in R and compared our results to results obtained via variational and exact inference using Stan. We have found that one probably should not trust Stan’s variational inference implementation, and that our results closely correspond to the exact procedure. For more on variational inference, I recommend the excellent review article by Blei, Kucukelbir, and McAuliffe (<a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773">2017</a>).</p>
<hr />
<p><em>I would like to thank Don van den Bergh for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="normal-inverse-gamma-distribution">Normal-inverse-gamma Distribution</h3>
<p>The posterior distribution is a <a href="https://en.wikipedia.org/wiki/Normal-inverse-gamma_distribution">Normal-inverse-gamma distribution</a>:</p>
\[p(\beta, \sigma^2 \mid \mathbf{y}) = \frac{\gamma^{\alpha}}{\Gamma\left(\alpha\right)} \left(\sigma^2\right)^{-\alpha - 1} \text{exp}\left(-\frac{2\gamma + \lambda\left(\beta - \mu\right)^2}{2\sigma^2}\right) \enspace ,\]
<p>where</p>
\[\begin{aligned}
\mu &= \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}} \\[.5em]
\lambda &= \sum_{i=1}^n x_i + \frac{1}{\tau^2} \\[.5em]
\alpha &= \frac{n + 1}{2} \\[.5em]
\gamma &= \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2 - \frac{\left(\sum_{i=1}^n y_i x_i\right)^2}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}}\right)\right) \enspace .
\end{aligned}\]
<p>Note that the marginal posterior distribution for $\beta$ is actually a Student-t distribution, contrary to what we assume in our variational inference scheme.</p>
<h2 id="references">References</h2>
<ul>
<li>Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (<a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773">2017</a>). Variational inference: A review for statisticians. <em>Journal of the American Statistical Association, 112</em>(518), 859-877.</li>
<li>Huggins, J. H., Kasprzak, M., Campbell, T., & Broderick, T. (<a href="https://arxiv.org/abs/1910.04102">2019</a>). Practical Posterior Error Bounds from Variational Objectives. <em>arXiv preprint</em> arXiv:1910.04102.</li>
<li>Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (<a href="http://papers.nips.cc/paper/5758-automatic-variational-inference-in-stan">2015</a>). Automatic variational inference in Stan. In <em>Advances in Neural Information Processing Systems</em> (pp. 568-576).</li>
<li>Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (<a href="http://www.jmlr.org/papers/volume18/16-107/16-107.pdf">2017</a>). Automatic differentiation variational inference. <em>The Journal of Machine Learning Research, 18</em>(1), 430-474.</li>
<li>Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (<a href="https://arxiv.org/abs/1802.02538">2018</a>). Yes, but did it work?: Evaluating variational inference. <em>arXiv preprint</em> arXiv:1802.02538.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Fabian DablanderBayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let’s start unpacking! Preliminaries Bayes’ rule states that \[\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,\] where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.1 Bayes’ rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much easier and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization. Variational objective We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution: \[q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace ,\] where $\text{KL}(. \lvert \lvert.)$ denotes the Kullback-Leibler divergence: \[\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) \, \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace .\] We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that: \[\begin{aligned} \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z} \mid \mathbf{x})\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x})\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \int q(\mathbf{z}) \, \text{log } p(\mathbf{x}) \, \mathrm{d}\mathbf{z} \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x})}_{\text{Nemesis}} \enspace , \end{aligned}\] where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity: \[\begin{aligned} \text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x}) \right) \\[.5em] &= -\left(\mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x}) - \text{log } p(\mathbf{x})}_{\text{Nemesis perishes}}\right) \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \enspace . \end{aligned}\] We can expand the joint probability to get more insight into this equation: \[\begin{aligned} \text{ELBO}(q) &= \underbrace{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z})\right]}_{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]} - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z})}{q(\mathbf{z})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z})}\right] \\[.5em] &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z})\right) \enspace . \end{aligned}\] This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017). ELBO stands for evidence lower bound. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence: \[\begin{aligned} \text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x})\right) \\[.5em] \text{log } p(\mathbf{x}) &= \text{ELBO}(q) + \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \\[.5em] \text{log } p(\mathbf{x}) &\geq \text{ELBO}(q) \enspace , \end{aligned}\] since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see this and this blog post. Why variational? Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the single best value that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example this blog post). In our setting, we do not want to find a single best value but rather a single best function. To do this, we can use variational calculus from which variational inference derives its name (Bishop, 2006, p. 462). A function takes an input value and returns an output value. We can define a functional which takes a whole function and returns an output value. The entropy of a probability distribution is a widely used functional: \[\text{H}[p] = \int p(x) \, \text{log } p(x) \mathrm{d} x \enspace ,\] which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional. In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem. Mean-field variational family A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density: \[q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j) \enspace .\] Note that this mean-field variational family cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem. We start from our definition of the ELBO and apply the mean-field assumption: \[\begin{aligned} \text{ELBO}(q) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em] &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z}\enspace . \end{aligned}\] In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed: \[\begin{aligned} \text{ELBO}(q_j) &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z} \\[.5em] &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j - \underbrace{\int \prod_{i\neq j}^m q_i(z_i) \, \text{log} \prod_{i\neq j}^m q_i(z_i) \, \mathrm{d}\mathbf{z}_{-j}}_{\text{Constant with respect to } q_j(z_j)} \\[.5em] &\propto \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em] &= \int q_j(z_j) \left(\int \prod_{i\neq j}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z}_{-j}\right)\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em] &= \int q_j(z_j) \, \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \enspace . \end{aligned}\] One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution \[\text{log } \tilde{p}{(\mathbf{x}, z_j)} = \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathcal{Z} \enspace ,\] where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that: \[\begin{aligned} \text{ELBO}(q_j) &\propto \int q_j(z_j) \, \text{log } \tilde{p}{(\mathbf{x}, z_j)} \, \mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em] &= \int q_j(z_j) \, \text{log } \frac{\tilde{p}{(\mathbf{x}, z_j)}}{q_j(z_j)} \, \mathrm{d}z_j \\[.5em] &= -\int q_j(z_j) \, \text{log } \frac{q_j(z_j)}{\tilde{p}{(\mathbf{x}, z_j)}} \, \mathrm{d}z_j \\[.5em] &= -\text{KL}\left(q_j(z_j) \, \lvert\lvert \, \tilde{p}(\mathbf{x}, z_j)\right) \enspace . \end{aligned}\] Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by: \[\begin{aligned} q_j^\star(z_j) &= \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right] - \mathcal{Z}\right) \\[.5em] &= \frac{\text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right)}{\int \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right) \mathrm{d}z_j} \enspace , \end{aligned}\] see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as Coordinate Ascent Variational Inference (CAVI). Further, note that \[p(z_j \mid \mathbf{z}_{-j}, \mathbf{x}) = \frac{p(z_j, \mathbf{z}_{-j}, \mathbf{x})}{p(\mathbf{z}_{-j}, \mathbf{x})} \propto p(z_j, \mathbf{z}_{-j}, \mathbf{x}) \enspace ,\] which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks a lot like Gibbs sampling, which we discussed in detail in a previous blog post. In the next section, we implement CAVI for a simple linear regression problem. Application: Linear regression In a previous blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is: \[\begin{aligned} y &\sim \mathcal{N}(\beta x, \sigma^2) \\[.5em] \beta &\sim \mathcal{N}(0, \sigma^2 \tau^2) \\[.5em] \sigma^2 &\propto \frac{1}{\sigma^2} \enspace , \end{aligned}\] where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys’ prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification: \[\text{Var}\left[\frac{\beta}{\sigma}\right] = \frac{1}{\sigma^2} \text{Var}[\beta] = \frac{\sigma^2 \tau^2}{\sigma^2} = \tau^2 \enspace .\] As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$. Variational density for $\sigma^2$ Our optimal variational density $q^\star(\sigma^2)$ is given by: \[q^\star(\sigma^2) \propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \enspace .\] To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write: \[\begin{aligned} p(\sigma^2 \mid \mathbf{y}, \beta) &\propto p(\mathbf{y} \mid \sigma^2, \beta) \, p(\beta) \, p(\sigma^2) \\[.5em] &= \prod_{i=1}^n (2\pi)^{-\frac{1}{2}} \left(\sigma^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2} \left(y_i - \beta x_i\right)^2\right) \underbrace{(2\pi)^{-\frac{1}{2}} \left(\sigma^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}_{p(\beta)} \underbrace{\left(\sigma^2\right)^{-1}}_{p(\sigma^2)} \\[.5em] &= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em] &\propto\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{2\sigma^2} \underbrace{\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)}_{A}\right) \enspace , \end{aligned}\] which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write: \[\begin{aligned} q^\star(\sigma^2) &\propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \\[.5em] &= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{2\sigma^2}A \right]\right) \\[.5em] &= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1}\right] - \mathbb{E}_{q(\beta)}\left[\frac{1}{2\sigma^2}A \right]\right) \\[.5em] &= \text{exp}\left(\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \\[.5em] &= \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \enspace . \end{aligned}\] This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at: \[q^\star(\sigma^2)= \frac{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \underbrace{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]}_{\nu}\right) \enspace .\] Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$. Variational density for $\beta$ Our optimal variational density $q^\star(\beta)$ is given by: \[q^\star(\beta) \propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \enspace ,\] and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write: \[\begin{aligned} p(\beta \mid \mathbf{y}, \sigma^2) &\propto p(\mathbf{y} \mid \beta, \sigma^2) \, p(\beta) \, p(\sigma^2) \\[.5em] &= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em] &= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^ny_i^2- 2 \beta \sum_{i=1}^n y_i x_i + \beta^2 \sum_{i=1}^n x_i^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em] &\propto \text{exp}\left(-\frac{1}{2\sigma^2} \left( \beta^2 \left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) - 2 \beta \sum_{i=1}^n y_i x_i\right)\right) \\[.5em] &=\text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta^2 - 2 \beta \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)\right) \\[.5em] &\propto \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace , \end{aligned}\] where we have “completed the square” (see also this blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations: \[\begin{aligned} q^\star(\beta) &\propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \\[.5em] &= \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right]\right) \\[.5em] &= \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace , \end{aligned}\] which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields: \[q^\star(\beta) = \left(2\pi\underbrace{\frac{\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}}_{\sigma^2_{\beta}}\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left(\beta - \underbrace{\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}}_{\mu_{\beta}}\right)^2\right) \enspace ,\] Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not. To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys’s prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations. Removing expectations Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, to remove the expectation in the variational density for $\sigma^2$, we write: \[\begin{aligned} \mathbb{E}_{q(\beta)}\left[A \right] &= \mathbb{E}_{q(\beta)}\left[\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right] \\[.5em] &= \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mathbb{E}_{q(\beta)}\left[\beta\right] + \sum_{i=1}^n x_i^2 \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] + \frac{1}{\tau^2} \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] \enspace . \end{aligned}\] Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that: \[\mathbb{E}_{q(\beta)}[\beta^2] = \text{Var}_{q(\beta)}\left[\beta\right] + \mathbb{E}_{q(\beta)}[\beta]^2 = \sigma^2_{\beta} + \mu_{\beta}^2 \enspace ,\] the expectation becomes: \[\mathbb{E}_{q(\beta)}\left[A\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) \enspace .\] For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. We write: \[\begin{aligned} \mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] &= \int \frac{1}{\sigma^2}\frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right) \mathrm{d}\sigma^2\\[0.50em] &= \frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)} \int \left(\sigma^2\right)^{-\left(\frac{n + 1}{2} + 1\right) - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right) \mathrm{d}\sigma^2 \\[0.50em] &= \frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)} \frac{\Gamma\left(\frac{n + 1}{2} + 1\right)}{\nu^{\frac{n + 1}{2} + 1}} \\[0.50em] &= \frac{n + 1}{2} \left(\frac{1}{2}\mathbb{E}_{q(\beta)}\left[A \right]\right)^{-1} \\[.5em] &= \frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \enspace . \end{aligned}\] Monitoring convergence The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write: \[\begin{aligned} \text{ELBO}(q) &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y}, \beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2) \right] \\[.5em] &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \mathbb{E}_{p(\beta, \sigma^2)}\left[\text{log } p(\beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2)\right] \\[.5em] &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \underbrace{\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right]}_{-\text{KL}\left(q(\beta, \sigma^2) \, \lvert\lvert \, p(\beta, \sigma^2)\right)}\enspace . \end{aligned}\] Let’s take a deep breath and tackle the second term first: \[\begin{aligned} \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{p(\beta \mid \sigma^2)}{q(\beta)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\left(2\pi\sigma^2\tau^2\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}{\left(2\pi\sigma^2_\beta\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2_\beta} (\beta - \mu_\beta)^2\right)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\frac{1}{\sigma^2\tau^2} \beta^2}{\frac{1}{\sigma^2_\beta} (\beta - \mu_\beta)^2}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \mathbb{E}_{q(\sigma^2)}\left[\text{log}\frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\sigma^2_\beta + \mu_\beta^2}{\sigma^2\tau^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \text{log}\frac{\tau^2}{\sigma^2_\beta} + \mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em] &= \text{log}\frac{\tau^2}{\sigma^2_\beta} + \mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\sigma^2)\right] - \mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]\enspace . \end{aligned}\] Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration: \[\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right] = \int q(\sigma^2) \, \text{log } q(\sigma^2) \, \mathrm{d}\sigma^2 \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\text{log } q(\sigma^2)}_{\sigma^2 \, \sim \, q(\sigma^2)} \enspace ,\] We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient: \[\begin{aligned} \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] &= \int \int q(\beta) \, q(\sigma^2) \, \text{log } p(\mathbf{y} \mid \beta, \sigma^2) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em] &=\int q(\beta) \int q(\sigma^2) \, \text{log} \left(\left(2\pi\sigma^2\right)^{-\frac{n}{2}}\text{exp}\left(-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i\beta)^2\right)\right) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em] &= \frac{n}{4} \text{log}\left(2\pi\right)\int q(\beta) \left(\sum_{i=1}^n (y_i - x_i\beta)^2\right) \, \mathrm{d}\beta\int q(\sigma^2) \, \, \text{log} \left(\sigma^2\right)\frac{1}{\sigma^2} \, \mathrm{d}\sigma \enspace . \end{aligned}\] Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically: \[\mathbb{E}_{q(\beta)}\left[\sum_{i=1}^n (y_i - x_i\beta)^2\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right) \enspace .\] In the next section, we implement the algorithm for our linear regression problem in R. Implementation in R Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO: The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation. ↩Harry Potter and the Power of Bayesian Inference2019-09-28T08:00:00+00:002019-09-28T08:00:00+00:00https://fabiandablander.com/r/Bayes-Potter<p>If you are reading this, you are probably a Ravenclaw. Or a Hufflepuff. Certainly not a Slytherin … but maybe a Gryffindor?</p>
<p>In this blog post, we let three subjective Bayesians predict the outcome of ten coin flips. We will derive prior predictions, evaluate their accuracy, and see how fortune favours the bold. We will also discover a neat trick that allows one to easily compute Bayes factors for models with parameter restrictions compared to models without such restrictions, and use it to answer a question we truly care about: are Slytherins really the bad guys?</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>As in a <a href="https://fabiandablander.com/r/Regularization.html">previous blog post</a>, we start by studying coin flips. Let $\theta \in [0, 1]$ be the bias of the coin and let $y$ denote the number of heads out of $n$ coin flips. We use the Binomial likelihood</p>
\[p(y \mid \theta) = {n \choose y} \theta^y (1 - \theta)^{n - y} \enspace ,\]
<p>and a Beta prior for $\theta$:</p>
\[p(\theta) = \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \enspace .\]
<p>This prior is <em>conjugate</em> for this likelihood which means that the posterior is again a Beta distribution. The Figure below shows two examples of this.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>In this blog post, we will use a <em>prior predictive</em> perspective on model comparison by means of Bayes factors. For an extensive contrast with a perspective based on <em>posterior prediction</em>, see <a href="https://fabiandablander.com/r/Law-of-Practice.html">this blog post</a>. The Bayes factor indicates how much better a model $\mathcal{M}_1$ predicts the data $y$ <em>relative to another model</em> $\mathcal{M}_0$:</p>
\[\text{BF}_{10} = \frac{p(y \mid \mathcal{M}_1)}{p(y \mid \mathcal{M}_0)} \enspace ,\]
<p>where we can write the <em>marginal likelihood</em> of a generic model $\mathcal{M}$ more complicatedly to see the dependence on the model’s priors:</p>
\[p(y \mid \mathcal{M}) = \int_{\Theta} p(y \mid \theta, \mathcal{M}) \, p(\theta \mid \mathcal{M}) \, \mathrm{d}\theta \enspace .\]
<p>After these preliminaries, in the next section, we visit Ron, Harry, and Hermione in Hogwarts.</p>
<h1 id="the-hogwarts-prediction-contest">The Hogwarts prediction contest</h1>
<p>Ron, Harry, and Hermione just came back from a straining adventure — Death Eaters and all. They deserve a break, and Hermione suggests a small prediction contest to relax. Ron is put off initially; relaxing by thinking? That’s not his style. Harry does not care either way; both are eventually convinced.</p>
<p>The goal of the contest is to accuratly predict the outcome of $n = 10$ coin flips. Luckily, this is not a particularly complicated problem to model, and we can use the Binomial likelihood we have discussed above. In the next section, Ron, Harry, and Hermione — all subjective Bayesians — clearly state their prior beliefs which is required to make predictions.</p>
<h2 id="prior-beliefs">Prior beliefs</h2>
<p>Ron is not big on thinking, and so trusts his previous intuitions that coins are usually unbiased; he specifies a point mass on $\theta = 0.50$ as his prior. Harry spreads his bets evenly, and believes that all chances governing the coin flip’s outcome are equally likely; he puts a uniform prior on $\theta$. Hermione, on the other hand, believes that the coin <em>cannot</em> be biased towards tails; instead, she believes that all values $\theta \in [0.50, 1]$ are equally likely. She thinks this because Dobby — the house elf — is the one who throws the coin, and she has previously observed him passing time by flipping coins, which strangely almost always landed up heads. To sum up, their priors are:</p>
\[\begin{aligned}
\text{Ron} &: \theta = 0.50 \\[.5em]
\text{Harry} &: \theta \sim \text{Beta}(1, 1) \\[.5em]
\text{Hermione} &: \theta \sim \text{Beta}(1, 1)\mathbb{I}(0.50, 1) \enspace ,
\end{aligned}\]
<p>which are visualized in the Figure below.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In the next section, the three use their beliefs to make probabilistic predictions.</p>
<h2 id="prior-predictions">Prior predictions</h2>
<p>Ron, Harry, and Hermione are subjective Bayesians and therefore evaluate their performance by their respective predictive accuracy. Each of the trio has a <em>prior predictive distribution</em>. For Ron, true to character, this is the easiest to derive. We associate model $\mathcal{M}_0$ with him and write:</p>
\[\begin{aligned}
p(y \mid \mathcal{M}_0) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_0) \, p(\theta \mid \mathcal{M}_0) \, \mathrm{d}\theta \\[.5em]
&= {n \choose y} 0.50^y (1 - 0.50)^{n - y} \enspace ,
\end{aligned}\]
<p>where the integral — the sum! — is just over the value $\theta = 0.50$. Ron’s prior predictive distribution is simply a Binomial distribution. He is delighted by this fact, and enjoys a short rest while the others derive their predictions.</p>
<p>It is Harry’s turn, and he is a little put off by his integration problem. However, he realizes that the integrand is an unnormalized Beta distribution, and swiftly writes down its normalizing constant, the Beta function. Associating $\mathcal{M}_1$ with him, his steps are:</p>
\[\begin{aligned}
p(y \mid \mathcal{M}_1) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \frac{1}{\text{B}(1, 1)} \theta^{1 - 1} (1 - \theta)^{1 - 1} \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \mathrm{d}\theta \\[.5em]
&= {n \choose y} \text{Beta}(y + 1, n - y + 1) \enspace ,
\end{aligned}\]
<p>which is a <a href="https://en.wikipedia.org/wiki/Beta-binomial_distribution">Beta-Binomial distribution</a> with $\alpha = \beta = 1$.</p>
<p>Hermione’s integral is the most complicated of the three, but she is also the smartest of the bunch. She is a master of the wizardry that is computer programming, which allows her to solve the integral numerically.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote">1</a></sup> We associate $\mathcal{M}_r$, which stands for <em>restricted</em> model, with her and write:</p>
\[\begin{aligned}
p(y \mid \mathcal{M}_r) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta \\[.5em]
&= \int_{0.50}^1 {n \choose y} \theta^y (1 - \theta)^{n - y} \, 2 \, \mathrm{d}\theta \\[.5em]
&= 2{n \choose y}\int_{0.50}^1 \theta^y (1 - \theta)^{n - y} \mathrm{d}\theta \enspace .
\end{aligned}\]
<p>We can draw from the prior predictive distributions by simulating from the prior and then making predictions through the likelihood. For Hermione, for example, this yields:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">nr_draws</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">20</span><span class="w">
</span><span class="n">theta_Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_draws</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">predictions_Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_draws</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">theta_Hermione</span><span class="p">)</span><span class="w">
</span><span class="n">predictions_Hermione</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 6 4 10 4 5 8 8 8 10 9 4 8 4 4 4 9 6 9 6 8</code></pre></figure>
<p>Let’s visualize Ron’s, Harry’s, and Hermione’s prior predictive distributions to get a better feeling for what they believe are likely coin flip outcomes. First, we implement their prior predictions in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Ron</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">0.50</span><span class="o">^</span><span class="n">n</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Harry</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">beta</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">int</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="n">theta</span><span class="o">^</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta</span><span class="p">)</span><span class="o">^</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">int</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Even though Ron believes that $\theta = 0.50$, this does not mean that his prior prediction puts all mass on $y = 5$; deviations from this value are plausible. Harry’s prior predictive distribution also makes sense: since he believes all values for $\theta$ to be equally likely, he should believe all outcomes are equally likely. Hermione, on the other hand, believes that $\theta \in [0.50, 1]$, so her prior probabilities for outcomes with few heads ($y < 5$) drastically decrease.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>After the three have clearly stated their prior beliefs and derived their prior predictions, Dobby throws a coin ten times. The coin comes up heads nine times. In the next section, we discuss the relative predictive performance of Ron, Harry, and Hermione based on these data.</p>
<h2 id="evaluating-predictions">Evaluating predictions</h2>
<p>To assess the relative predictive performance of Ron, Harry, and Hermione, we need to compute the probability mass of $y = 9$ for their respective prior predictive distributions. Compared to Ron, Hermione did roughly 19 times better:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Hermione</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Ron</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 18.50909</code></pre></figure>
<p>Harry, on the other hand, did about 9 times better than Ron:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Harry</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Ron</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 9.309091</code></pre></figure>
<p>With these two comparisons, we also know by how much Hermione outperformed Harry, since by transitivity we have:</p>
\[\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_0)} \times \frac{p(y \mid \mathcal{M}_0)}{p(y \mid \mathcal{M}_1)} = \text{BF}_{r0} \times \frac{1}{\text{BF}_{10}} \approx 2 \enspace ,\]
<p>which is indeed correct:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Hermione</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Harry</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1.988281</code></pre></figure>
<p>Note that this is also immediately apparent from the visualizations above, where Hermione’s allocated probability mass is about twice as large as Harry’s for the case where $y = 9$.</p>
<p>Hermione was bold in her prediction, and was rewarded with being favoured by a factor of two in predictive performance. Note that if her predictions would have been even bolder, say restricting her prior to $\theta \in [0.80, 1]$, she would have reaped higher rewards than a Bayes factor in favour of two. Contrast this to Dobby throwing the coin ten times and with only one heads showing. Then Harry’s marginal likelihood is still $\text{Beta}(11, 1) = \frac{1}{11}$. However, Hermione’s is not twice as much; instead, it is a mere $0.001065$, which would result in a Bayes factor of about $85$ in favour of Harry!</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Harry</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Hermione</span><span class="p">(</span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 85.33333</code></pre></figure>
<p>This means that with bold predictions, one can also lose a lot. However, this is tremendously insightful, since Hermione would immediately realize where she went wrong. For a discussion that also points out the flexibility of Bayesian model comparison, see Etz, Haaf, Rouder, & Vandeckerckhove (2018).</p>
<p>In the next section, we will discover a nice trick which simplifies the computation of the Bayes factor; we do not need to derive marginal likelihoods, but can simply look at the prior and the posterior distribution of the parameter of interest in the unrestricted model.</p>
<h1 id="prior--posterior-trick">Prior / Posterior trick</h1>
<p>As it it turns out, the relative predictive performance of Hermione compared to Harry is given by the ratio of the purple area to the blue area in the figure below.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-10-1.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" style="display: block; margin: auto;" /></p>
<p>In other words, the Bayes factor in favour of the <em>restricted</em> model (i.e., Hermione) compared to the <em>unrestricted</em> or <em>encompassing</em> model (i.e., Harry) is given by the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. We can check this numerically:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># (1 - pbeta(0.50, 10, 2)) / 0.50 would also work</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="n">dbeta</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">0.50</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1.988281</code></pre></figure>
<p>This is a very cool result which, to my knowledge, was first described in Kluglist & Hoijtink (2005). In the next section, we prove it.</p>
<h2 id="proof">Proof</h2>
<p>The proof uses two insights. First, note that we can write the priors in the restricted model, $\mathcal{M}_r$, as priors in the encompassing model, $\mathcal{M}_1$, subject to some constraints. In the Hogwarts prediction context, Hermione’s prior was a restricted version of Harry’s prior:</p>
\[\begin{aligned}
p(\theta \mid \mathcal{M}_r) &= p(\theta \mid \mathcal{M}_1)\mathbb{I}(0.50, 1) \\[1em]
&= \begin{cases} \frac{p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} & \text{if} \hspace{1em} \theta \in [0.50, 1] \\[1em] 0 & \text{otherwise}\end{cases}
\end{aligned}\]
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>We have to divide by the term</p>
\[K = \int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta = 0.50 \enspace ,\]
<p>so that the restricted prior integrates to 1, as all proper probability distributions must. As a direct consequence, note that the density of a value $\theta = \theta^{\star}$ is given by:</p>
\[p(\theta^{\star} \mid \mathcal{M}_r) = p(\theta^{\star} \mid \mathcal{M}_1) \cdot \frac{1}{K} \enspace ,\]
<p>where $K$ is the renormalization constant. This means that we can rewrite terms which include the restricted prior in terms of the unrestricted prior from the encompassing model. This also holds for the posterior!</p>
<p>To see that we can also write the restricted posterior in terms of the unrestricted posterior from the encompassing model, note that the likelihood is the same under both models and that:</p>
\[\begin{aligned}
p(\theta \mid y, \mathcal{M}_r) &= \frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K}}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K} \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \\[.5em]
&= \frac{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta}}{\int_{0.50}^1 \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(\theta \mid y, \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta} \enspace ,
\end{aligned}\]
<p>where we have to renormalize by</p>
\[Z = \int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta \enspace ,\]
<p>which is</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pbeta</span><span class="p">(</span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 0.9941406</code></pre></figure>
<p>The figure below visualizes Harry’s and Hermione’s posterior. Sensibly, since Hermione excluded all $\theta \in [0, 0.50]$ in her prior, such values receive zero credence in her posterior. However, the difference in posterior distributions between Harry and Hermione is very weak in contrast to the difference in prior distribution. This is reflected in $Z$ being close to 1.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-14-1.png" title="plot of chunk unnamed-chunk-14" alt="plot of chunk unnamed-chunk-14" style="display: block; margin: auto;" /></p>
<p>Similar to the prior, we can write the density of a value $\theta = \theta^\star$ in terms of the encompassing model:</p>
\[p(\theta^{\star} \mid y, \mathcal{M}_r) = p(\theta^{\star} \mid y, \mathcal{M}_1) \cdot \frac{1}{Z} \enspace .\]
<p>Now that we have established that we can write both the prior and the posterior density of parameters in the restricted model in terms of the parameters in the unrestricted model, as a second step, note that we can swap the posterior and the marginal likelihood terms in Bayes’ rule such that:</p>
\[p(y \mid \mathcal{M}_1) = \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)} \enspace ,\]
<p>from which it follows that:</p>
\[\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_1)} = \frac{\frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{p(\theta \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)}} \enspace .\]
<p>Now suppose that we have values that are in line with the restriction, i.e., $\theta = \theta^{\star}$. Then:</p>
\[\begin{aligned}
\text{BF}_{r1} = \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star\mid \mathcal{M}_r)}{p(\theta^\star \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}
= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star \mid \mathcal{M}_1) \, \frac{1}{K}}{p(\theta^\star \mid y, \mathcal{M}_1) \, \frac{1}{Z}}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}
= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, \frac{1}{K}}{\frac{1}{Z}}}{p(y \mid \theta^\star, \mathcal{M}_1)} = \frac{\frac{1}{K}}{\frac{1}{Z}} = \frac{Z}{K} \enspace ,
\end{aligned}\]
<p>where we have used the previous insights and the fact that the likelihood under $\mathcal{M}_r$ and $\mathcal{M}_1$ is the same. If we expand the constants for our previous problem, we have:</p>
\[\text{BF}_{r1} = \frac{Z}{K} = \frac{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} = \frac{p(\theta \in [0.50, 1] \mid y, \mathcal{M}_1)}{p(\theta \in [0.50, 1] \mid \mathcal{M}_1)} \enspace ,\]
<p>which is, as claimed above, the posterior probability of values for $\theta$ that are in line with the restriction divided by the prior probability of values for $\theta$ that are in line with the restriction. Note that this holds for arbitrary restrictions of an arbitrary number of parameters (see Kluglist & Hoijtink, 2005). In the limit where we take the restriction to be infinitesimally small, that is, constrain the parameter to be a point value, this results in the Savage-Dickey density ratio (Wetzels, Grasman, & Wagenmakers, 2010).</p>
<!-- To illustrate this, assume that Hermione could have believed that $\theta$ is equally likely to be smaller $0.25$ or larger than $0.75$. Her prior and posterior are visualized in the figure below. -->
<!-- ```{r, echo = FALSE, fig.width = 10, fig.height = 5, fig.align = 'center', message = FALSE, warning = FALSE, dpi=400} -->
<!-- library('latex2exp') -->
<!-- x <- seq(.000, 1, .001) -->
<!-- par(mfrow = c(1, 2)) -->
<!-- Hermione_prior <- function(x) { -->
<!-- if (x < .25) { -->
<!-- res <- dunif(x, 0, 0.25) / 2 -->
<!-- } else { -->
<!-- res <- dunif(x, 0.75, 1) / 2 -->
<!-- } -->
<!-- res -->
<!-- } -->
<!-- Hermione_posterior <- function(x, y = 9, n = 10) { -->
<!-- fn <- function(x) { -->
<!-- Hermione_prior(x) * dbinom(y, n, x) -->
<!-- } -->
<!-- 2 * Hermione_prior(x) * dbinom(y, n, x) -->
<!-- } -->
<!-- plot( -->
<!-- x, sapply(x, Hermione_prior), xlim = c(0, 1), type = 'l', ylab = 'Density', lty = 1, -->
<!-- xlab = TeX('$\\theta$'), las = 1, main = 'Hermione\'s Prior', lwd = 3, ylim = c(0, 4), -->
<!-- cex.lab = 1.5, cex.main = 1.5, col = 'skyblue', axes = FALSE -->
<!-- ) -->
<!-- axis(1, at = seq(0, 1, .2)) #adds custom x axis -->
<!-- axis(2, las = 1) # custom y axis -->
<!-- plot( -->
<!-- x, sapply(x, Hermione_posterior), xlim = c(0, 1), type = 'l', ylab = 'Density', lty = 1, -->
<!-- xlab = TeX('$\\theta$'), las = 1, main = 'Hermione\'s Posterior', lwd = 3, ylim = c(0, 4), -->
<!-- cex.lab = 1.5, cex.main = 1.5, col = 'darkorchid1', axes = FALSE -->
<!-- ) -->
<!-- axis(1, at = seq(0, 1, .2)) #adds custom x axis -->
<!-- axis(2, las = 1) # custom y axis -->
<!-- ``` -->
<!-- The Bayes factor in favour of Hermione compared to Harry is given by: -->
<!-- ```{r} -->
<!-- K <- 2 -->
<!-- Z <- pbeta(0.25, 10, 2) + (1 - pbeta(0.75, 10, 2)) -->
<!-- Z / K -->
<!-- ``` -->
<p>In the next section, we apply this idea to a data set that relates Hogwarts Houses to personality traits.</p>
<h1 id="hogwarts-houses-and-personality">Hogwarts Houses and personality</h1>
<p>So, are you a Slytherin, Hufflepuff, Ravenclaw, or Gryffindor? And what does this say about your personality?</p>
<p>Inspired by Crysel et al. (2015), Lea Jakob, Eduardo Garcia-Garzon, Hannes Jarke, and I analyzed self-reported personality data from 847 people as well as their self-reported Hogwards House affiliation.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote">2</a></sup> We wanted to answer questions such as: do people who report belonging to Slytherin tend to score highest on Narcissism, Machiavellianism, and Psychopathy? Are Hufflepuffs the most agreeable, and Gryffindors the most extraverted? The Figure below visualizes the raw data.</p>
<div style="text-align:center;">
<img src="../assets/img/Potter-Personality.png" align="center" style="margin-top: -10px; padding-bottom: 0px;" width="680" height="540" />
</div>
<p>We used a between-subjects ANOVA as our model and, in the case of for example Agreeableness, compared the following hypotheses:</p>
\[\begin{aligned}
\mathcal{H}_0&: \mu_H = \mu_G = \mu_R = \mu_S \\[.5em]
\mathcal{H}_r&: \mu_H > (\mu_G , \mu_R , \mu_S) \\[.5em]
\mathcal{H}_1&: \mu_H , \mu_G , \mu_R , \mu_S
\end{aligned}\]
<p>We used the BayesFactor R package to compute the Bayes factor in favour of $\mathcal{H}_1$ compared to $\mathcal{H}_0$. For the restricted hypotheses $\mathcal{H}_r$, we used the prior/posterior trick outlined above; and indeed, we found strong evidence in favour of the notion that, for example, Hufflepuffs score highest on Agreeableness. Curious about Slytherin and the other Houses? You can read the published paper with all the details <a href="https://www.collabra.org/article/10.1525/collabra.240/">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Participating in a relaxing prediction contest, we saw how three subjective Bayesians named Ron, Harry, and Hermione formalized their beliefs and derived their predictions about the likely outcome of ten coin flips. By restricting her prior beliefs about the bias of the coin to exclude values smaller than $\theta = 0.50$, Hermione was the boldest in her predictions and was ultimately rewarded. However, if the outcome of the coin flips would have turned out differently, say $y = 2$, then Hermione would have immediately realized how wrong her beliefs were. I think we as scientists need to be more like Hermione: we need to make more precise predictions, allowing us to construct more powerful tests and “fail” in insightful ways.</p>
<p>We also saw a neat trick by which one can compute the Bayes factor in favour of a restricted model compared to an unrestricted model by estimating the proportion of prior and posterior values of the parameter that are in line with the restriction — no painstaking computation of marginal likelihoods required! We used this trick to find evidence for what we all knew deep in our hearts already: Hufflepuffs are <em>so</em> agreeable.</p>
<hr />
<p><em>I would like to thank Sophia Crüwell and Lea Jakob for helpful comments on this blog post.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Klugkist, I., Kato, B., & Hoijtink, H. (<a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9574.2005.00279.x">2005</a>). Bayesian model selection using encompassing priors. <em>Statistica Neerlandica, 59</em>(1), 57-69.</li>
<li>Wetzels, R., Grasman, R. P., & Wagenmakers, E. J. (<a href="https://www.sciencedirect.com/science/article/pii/S0167947310001180">2010</a>). An encompassing prior generalization of the Savage–Dickey density ratio. <em>Computational Statistics & Data Analysis, 54</em>(9), 2094-2102.</li>
<li>Etz, A., Haaf, J. M., Rouder, J. N., & Vandekerckhove, J. (<a href="https://journals.sagepub.com/doi/full/10.1177/2515245918773087">2018</a>). Bayesian inference and testing any hypothesis you can specify. <em>Advances in Methods and Practices in Psychological Science, 1</em>(2), 281-295.</li>
<li>Crysel, L. C., Cook, C. L., Schember, T. O., & Webster, G. D. (<a href="https://www.sciencedirect.com/science/article/abs/pii/S0191886915002615">2015</a>). Harry Potter and the measures of personality: Extraverted Gryffindors, agreeable Hufflepuffs, clever Ravenclaws, and manipulative Slytherins. <em>Personality and Individual Differences, 83</em>, 174-179.</li>
<li>Jakob, L., Garcia-Garzon, E., Jarke, H., & Dablander, F. (<a href="https://www.collabra.org/article/10.1525/collabra.240/">2019</a>). The Science Behind the Magic? The Relation of the Harry Potter “Sorting Hat Quiz” to Personality and Human Values. <em>Collabra: Psychology, 5</em>(1).</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>The analytical solution is <a href="https://www.wolframalpha.com/input/?i=Integral%5Btheta%5Ey+*+%281+-+theta%29%5E%28n+-+y%29%2C+theta%2C+0.50%2C+1%5D">unpleasant</a>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>You can discover your Hogwarts House affiliation at <a href="https://www.pottermore.com/">https://www.pottermore.com/</a>. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Fabian DablanderIf you are reading this, you are probably a Ravenclaw. Or a Hufflepuff. Certainly not a Slytherin … but maybe a Gryffindor? In this blog post, we let three subjective Bayesians predict the outcome of ten coin flips. We will derive prior predictions, evaluate their accuracy, and see how fortune favours the bold. We will also discover a neat trick that allows one to easily compute Bayes factors for models with parameter restrictions compared to models without such restrictions, and use it to answer a question we truly care about: are Slytherins really the bad guys? Preliminaries As in a previous blog post, we start by studying coin flips. Let $\theta \in [0, 1]$ be the bias of the coin and let $y$ denote the number of heads out of $n$ coin flips. We use the Binomial likelihood \[p(y \mid \theta) = {n \choose y} \theta^y (1 - \theta)^{n - y} \enspace ,\] and a Beta prior for $\theta$: \[p(\theta) = \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \enspace .\] This prior is conjugate for this likelihood which means that the posterior is again a Beta distribution. The Figure below shows two examples of this. In this blog post, we will use a prior predictive perspective on model comparison by means of Bayes factors. For an extensive contrast with a perspective based on posterior prediction, see this blog post. The Bayes factor indicates how much better a model $\mathcal{M}_1$ predicts the data $y$ relative to another model $\mathcal{M}_0$: \[\text{BF}_{10} = \frac{p(y \mid \mathcal{M}_1)}{p(y \mid \mathcal{M}_0)} \enspace ,\] where we can write the marginal likelihood of a generic model $\mathcal{M}$ more complicatedly to see the dependence on the model’s priors: \[p(y \mid \mathcal{M}) = \int_{\Theta} p(y \mid \theta, \mathcal{M}) \, p(\theta \mid \mathcal{M}) \, \mathrm{d}\theta \enspace .\] After these preliminaries, in the next section, we visit Ron, Harry, and Hermione in Hogwarts. The Hogwarts prediction contest Ron, Harry, and Hermione just came back from a straining adventure — Death Eaters and all. They deserve a break, and Hermione suggests a small prediction contest to relax. Ron is put off initially; relaxing by thinking? That’s not his style. Harry does not care either way; both are eventually convinced. The goal of the contest is to accuratly predict the outcome of $n = 10$ coin flips. Luckily, this is not a particularly complicated problem to model, and we can use the Binomial likelihood we have discussed above. In the next section, Ron, Harry, and Hermione — all subjective Bayesians — clearly state their prior beliefs which is required to make predictions. Prior beliefs Ron is not big on thinking, and so trusts his previous intuitions that coins are usually unbiased; he specifies a point mass on $\theta = 0.50$ as his prior. Harry spreads his bets evenly, and believes that all chances governing the coin flip’s outcome are equally likely; he puts a uniform prior on $\theta$. Hermione, on the other hand, believes that the coin cannot be biased towards tails; instead, she believes that all values $\theta \in [0.50, 1]$ are equally likely. She thinks this because Dobby — the house elf — is the one who throws the coin, and she has previously observed him passing time by flipping coins, which strangely almost always landed up heads. To sum up, their priors are: \[\begin{aligned} \text{Ron} &: \theta = 0.50 \\[.5em] \text{Harry} &: \theta \sim \text{Beta}(1, 1) \\[.5em] \text{Hermione} &: \theta \sim \text{Beta}(1, 1)\mathbb{I}(0.50, 1) \enspace , \end{aligned}\] which are visualized in the Figure below. In the next section, the three use their beliefs to make probabilistic predictions. Prior predictions Ron, Harry, and Hermione are subjective Bayesians and therefore evaluate their performance by their respective predictive accuracy. Each of the trio has a prior predictive distribution. For Ron, true to character, this is the easiest to derive. We associate model $\mathcal{M}_0$ with him and write: \[\begin{aligned} p(y \mid \mathcal{M}_0) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_0) \, p(\theta \mid \mathcal{M}_0) \, \mathrm{d}\theta \\[.5em] &= {n \choose y} 0.50^y (1 - 0.50)^{n - y} \enspace , \end{aligned}\] where the integral — the sum! — is just over the value $\theta = 0.50$. Ron’s prior predictive distribution is simply a Binomial distribution. He is delighted by this fact, and enjoys a short rest while the others derive their predictions. It is Harry’s turn, and he is a little put off by his integration problem. However, he realizes that the integrand is an unnormalized Beta distribution, and swiftly writes down its normalizing constant, the Beta function. Associating $\mathcal{M}_1$ with him, his steps are: \[\begin{aligned} p(y \mid \mathcal{M}_1) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta \\[.5em] &= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \frac{1}{\text{B}(1, 1)} \theta^{1 - 1} (1 - \theta)^{1 - 1} \, \mathrm{d}\theta \\[.5em] &= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \mathrm{d}\theta \\[.5em] &= {n \choose y} \text{Beta}(y + 1, n - y + 1) \enspace , \end{aligned}\] which is a Beta-Binomial distribution with $\alpha = \beta = 1$. Hermione’s integral is the most complicated of the three, but she is also the smartest of the bunch. She is a master of the wizardry that is computer programming, which allows her to solve the integral numerically.1 We associate $\mathcal{M}_r$, which stands for restricted model, with her and write: \[\begin{aligned} p(y \mid \mathcal{M}_r) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta \\[.5em] &= \int_{0.50}^1 {n \choose y} \theta^y (1 - \theta)^{n - y} \, 2 \, \mathrm{d}\theta \\[.5em] &= 2{n \choose y}\int_{0.50}^1 \theta^y (1 - \theta)^{n - y} \mathrm{d}\theta \enspace . \end{aligned}\] We can draw from the prior predictive distributions by simulating from the prior and then making predictions through the likelihood. For Hermione, for example, this yields: The analytical solution is unpleasant. ↩