Jekyll2020-06-19T08:40:20+00:00https://fabiandablander.com/feed.xmlFabian DablanderPhD Student Methods & StatisticsFabian DablanderVisualising the COVID-19 Pandemic2020-06-19T08:30:00+00:002020-06-19T08:30:00+00:00https://fabiandablander.com/r/Covid-Overview<p><em>This blog post first appeared on the <a href="https://scienceversuscorona.com/visualising-the-covid-19-pandemic/">Science versus Corona blog</a>. It introduces <a href="https://scienceversuscorona.shinyapps.io/covid-overview/">this Shiny app</a>.</em></p>
<p>The novel coronavirus has a firm grip on nearly all countries across the world, and there is large heterogeneity in how countries have responded to the threat.</p>
<p>Some countries, such as <a href="https://www.theguardian.com/world/2020/jun/05/brazil-coronavirus-covid-19-virus-doctor">Brazil</a> and the <a href="https://www.theguardian.com/us-news/2020/mar/28/trump-coronavirus-politics-us-health-disaster">United States</a>, have fared exceptionally poorly. Other countries, such as <a href="https://www.theatlantic.com/ideas/archive/2020/05/whats-south-koreas-secret/611215/">South Korea</a> and <a href="https://www.weforum.org/agenda/2020/05/how-germany-contained-the-coronavirus/">Germany</a>, have done exceptionally well. Many countries have faithfully executed lockdown measures, which have had an extraordinary preventive effect in saving lives (e.g., Flaxman et al., <a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>). While lockdowns have saved lives, they have had an extremely detrimental effect on rich countries such as the United Kingdom, whose <a href="https://www.bbc.com/news/business-53019360">GDP dropped by 20.4% in April</a> (see also Pichler et al., <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3606984">2020</a>), and the United States, where <a href="https://www.theguardian.com/business/2020/may/28/jobless-america-unemployment-coronavirus-in-figures">over 40 million people filed for unemployment</a>. Lockdowns have been even <a href="https://www.economist.com/international/2020/05/23/covid-19-is-undoing-years-of-progress-in-curbing-global-poverty">more devastating for developing countries</a>.</p>
<p>It is insightful to study the past course of how the virus swept across the world, and how countries have tried to fight it. But with about 8,100,000 confirmed cases, over 430,000 deaths, and many countries slowly reopening amid an accelerating pandemic, it is even more important to pay close attention now in order to learn from each other. Many excellent overviews comparing confirmed cases, deaths, and measures to curb the spread of the virus taken across countries have been produced by leading newspapers.</p>
<h2 id="visualising-the-pandemic">Visualising the Pandemic</h2>
<p>The Financial Times has been an <a href="https://www.ft.com/content/a26fbf7e-48f8-11ea-aeb3-955839e06441">excellent resource</a> of information and visualisation from the start of the pandemic. Their visualisations show, for example, that while at the start the epicenter of the pandemic has been Europe, it has shifted toward Latin America, which now accounts for most deaths. They have also <a href="https://ig.ft.com/coronavirus-lockdowns/">produced a visualisation</a> of how countries are lifting lockdown measures using the Oxford Stringency Index, produced by the Oxford COVID-19 Government Response Tracker.</p>
<p>The <a href="https://www.bsg.ox.ac.uk/research/research-projects/coronavirus-government-response-tracker">Oxford COVID-19 Government Response Tracker</a> collects information on different policy responses governments across the world have taken. Currently, they are tracking 17 measures taken in over 160 countries. The Oxford Stringency Index is a composite score ranging from 0 to 100 which summarizes a number of measures a country has taken (or not taken). In particular, these measures concern (1) school closures, (2) workplace closures, (3) the cancelling of public events, (4) restrictions on gatherings, (5) the closing of public transport, (6) stay at home requirements, (7) restrictions on internal movement, (8) international travel controls, and (9) public information campaigns. These measures differ in their strength, and whether they are applied generally or are targeted; for details, see Hale et al. (<a href="https://www.bsg.ox.ac.uk/research/publications/variation-government-responses-covid-19">2020a</a>). The Oxford Response Tracker is updated frequently, and now also has a Government Response Index and a Containment and Health Index (Hale et al., <a href="https://www.bsg.ox.ac.uk/research/publications/variation-government-responses-covid-19">2020a</a>).</p>
<p>The New York Times also has started producing <a href="https://www.nytimes.com/interactive/2020/world/coronavirus-maps.html">beautiful visualisations</a> that summarize how the virus is ravaging different parts of the world. I especially like their world map, which not only shows the daily confirmed cases but also the 14-day smoothed trend. Possibly inspired by <a href="https://www.endcoronavirus.org/countries">endcoronavirus.org</a>, the site also gives an overview of where cases are increasing, roughly staying the same, or decreasing. They also provide a more detailed picture of specific countries, showing for example each state and even county of the <a href="https://www.nytimes.com/interactive/2020/us/coronavirus-us-cases.html">United States</a>, or the <a href="https://www.nytimes.com/interactive/2020/world/asia/india-coronavirus-cases.html">states of India</a>.</p>
<p>Finally, ourworldindata.org has what I believe are the <a href="https://ourworldindata.org/coronavirus">most comprehensive COVID-19 visualisations</a>.</p>
<h2 id="another-visualisation">Another Visualisation</h2>
<p>Inspired by <a href="https://www.politico.eu/article/europes-coronavirus-lockdown-measures-compared/">this Politico piece</a>, <a href="https://nl.linkedin.com/in/ialmi">Alexandra Rusu</a>, <a href="https://www.sowi.uni-mannheim.de/en/meiser/team/research-staff/marcel-schreiner/">Marcel Schreiner</a>, <a href="https://www.atomasevic.com/">Aleksandar Tomašević</a>, and I — joining forces through <a href="https://scienceversuscorona.com/">Science versus Corona</a> — set out to work on our own visualisation before much of the excellent work by major newspapers was available. You can find it <a href="https://scienceversuscorona.shinyapps.io/covid-overview/">here</a>. We use the wonderful <a href="https://covid19datahub.io/">covid19 R package</a> as a data source.</p>
<p>Being written in R and Shiny, our app does not approach the beauty that comes with handcrafting JavaScript; yet it shows a few useful things that some of the above visualisations lack. First, it allows you to explore the evolution of individual measures — such as closing schools and international travel controls — countries have taken instead of reporting only a composite stringency index.</p>
<p>Second, our app visualises confirmed cases and confirmed deaths jointly with the stringency index in a single figure. This allows you to explore how they evolve together, and see whether deaths in countries that lift measures quickly rise soon thereafter or not. (You might find that imposing measures causes death, <a href="https://www.tylervigen.com/spurious-correlations">ha</a>!)</p>
<p>Third, our app includes a table that lists the individual measures countries are taking, and, if they have done so, when they have lifted them. Individual rows are coloured according to how close each country is to the WHO recommendations for rolling back lockdowns (see Hale et al., <a href="https://www.bsg.ox.ac.uk/research/publications/lockdown-rollback-checklist">2020b</a>). These WHO recommendations concern whether (1) virus transmission is controlled, (2) testing, tracing, and isolation is performed adequately, (3) outbreak risk in high-risk settings is minimized, (4) preventive measures are established in workplaces, (5) risk of exporting and importing cases from high-risk areas is managed, and (6) the public is engaged, understands that this is the ‘new normal’, and understand that they have a key role in preventing an increase in cases (see WHO, <a href="https://apps.who.int/iris/handle/10665/331773">2020</a>). Data concerning (4) and (5) are not in the Oxford database; we instead use the approach outlined in Hale et al. (<a href="https://www.bsg.ox.ac.uk/research/publications/lockdown-rollback-checklist">2020b</a>).</p>
<h2 id="caveats">Caveats</h2>
<p>Importantly, there are a number of caveats associated with interpreting the data we show in the app. First, the number of confirmed cases depends strongly on the number of tests a particular country conducts. Without knowing that, it is foolish to put much trust in comparisons of cases across countries. Hasell et al. (2020) provide a data set and a visualisation of <a href="https://ourworldindata.org/coronavirus-testing">coronavirus testing</a> per country, which is measured in number of tests per confirmed case or by one over that number (the so-called positivity rate). When the number of tests carried out per confirmed case is low, a country does too little testing to adequately monitor the outbreak — the true number of infections is likely much larger.</p>
<p>Another caveat concerns deaths. Confirmed deaths provide a clearer lens into how the pandemic unfolds, as every death in a country has to be reported. This is also why e.g. Flaxman et al. (<a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>) model confirmed deaths rather than confirmed cases to assess the effect of interventions. However, using confirmed deaths to compare how successful countries are in dealing with the virus has limitations as well. Since deaths take at least a week or two to materialize, they are a window into the past, not the present; deaths are thus not a real-time indicator to decide whether to impose or lift measures.</p>
<p>There is also large variation in how deaths are reported, both across countries and within time. Some countries only count hospital deaths, for example, thus leading to an underestimate of deaths caused by COVID-19 at home. Or they include only deaths of patients that have tested positively for the virus. Authoritarian regimes might also downplay cases to look better. Moreover, due to delays in reporting, new deaths per day do not necessarily reflect the actual number of deaths that day.</p>
<p>Demographics also play an important rule; some countries are much more densely populated, providing easier transmission routes for the virus. Others, such as countries in Africa, have a much younger population, making a severe disease progression less likely (e.g., Clark et al., <a href="https://bit.ly/3hS3vWy">2020</a>); with a healthcare system that is much less advanced compared to rich nations, however, Africa may well become the next epicenter of the pandemic (Loembé et al., <a href="https://www.nature.com/articles/s41591-020-0961-x">2020</a>). All these factors make <a href="https://www.bbc.com/news/52311014">international comparisons difficult</a>.</p>
<p>A different angle on COVID-19’s toll on human life is to calculate excess deaths by subtracting, say, the average number of deaths in the previous five years in a particular time period from the number of deaths during that time period now. Unlike for confirmed deaths, <a href="https://ourworldindata.org/excess-mortality-covid">numbers on excess deaths</a> are available only for a selected number of (mostly rich) countries, and there is no central data source. <a href="https://www.economist.com/graphic-detail/2020/04/16/tracking-covid-19-excess-deaths-across-countries">The Economist</a> was one of the first outlets to visualize excess deaths; the <a href="https://www.ft.com/content/a26fbf7e-48f8-11ea-aeb3-955839e06441">Financial Times</a> and the <a href="https://www.nytimes.com/interactive/2020/04/21/world/coronavirus-missing-deaths.html">New York Times</a> provide visualisations of excess death, too.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this blog post, I have outlined a number of excellent visualisations of the COVID19 pandemic, as well introduced <a href="https://scienceversuscorona.shinyapps.io/covid-overview/">our own</a>. <a href="https://nl.linkedin.com/in/ialmi">Alexandra Rusu</a>, <a href="https://www.sowi.uni-mannheim.de/en/meiser/team/research-staff/marcel-schreiner/">Marcel Schreiner</a>, and <a href="https://www.atomasevic.com/">Aleksandar Tomašević</a> — with whom it was an absolute pleasure working with on this — and I are planning to develop the visualisation further, including things such as number of tests, excess deaths, new Oxford indices, etc. and we encourage anybody who is interested to contribute! All the code is available on <a href="https://github.com/fdabl/Covid-Overview">Github</a>.</p>
<hr />
<p>I want to thank Alexandra Rusu, Marcel Schreiner, and Aleksandar Tomašević for a very enjoyable collaboration.</p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Clark, Jit, Warren-Gash et al. (<a href="https://bit.ly/3hS3vWy">2020</a>). Global, regional, and national estimates of the population at increased risk of severe COVID-19 due to underlying health conditions in 2020: A modelling study. <em>The Lancet</em>.</li>
<li>Flaxman, Mishra, Gandy et al. (<a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>). Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. <em>Nature</em>, 3164.</li>
<li>Loembé, M. M., Tshangela, A., Salyer, S. J., Varma, J. K., Ouma, A. E. O., & Nkengasong, J. N. (<a href="https://www.nature.com/articles/s41591-020-0961-x">2020</a>). COVID-19 in Africa: the spread and response. <em>Nature Medicine</em>, 1-4.</li>
<li>Pichler, A., Pangallo, M., del Rio-Chanona, R. M., Lafond, F., & Farmer, J. D. (<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3606984">2020</a>). Production networks and epidemic spreading: How to restart the UK economy?</li>
</ul>Fabian DablanderThis blog post first appeared on the Science versus Corona blog. It introduces this Shiny app. The novel coronavirus has a firm grip on nearly all countries across the world, and there is large heterogeneity in how countries have responded to the threat. Some countries, such as Brazil and the United States, have fared exceptionally poorly. Other countries, such as South Korea and Germany, have done exceptionally well. Many countries have faithfully executed lockdown measures, which have had an extraordinary preventive effect in saving lives (e.g., Flaxman et al., 2020). While lockdowns have saved lives, they have had an extremely detrimental effect on rich countries such as the United Kingdom, whose GDP dropped by 20.4% in April (see also Pichler et al., 2020), and the United States, where over 40 million people filed for unemployment. Lockdowns have been even more devastating for developing countries. It is insightful to study the past course of how the virus swept across the world, and how countries have tried to fight it. But with about 8,100,000 confirmed cases, over 430,000 deaths, and many countries slowly reopening amid an accelerating pandemic, it is even more important to pay close attention now in order to learn from each other. Many excellent overviews comparing confirmed cases, deaths, and measures to curb the spread of the virus taken across countries have been produced by leading newspapers. Visualising the Pandemic The Financial Times has been an excellent resource of information and visualisation from the start of the pandemic. Their visualisations show, for example, that while at the start the epicenter of the pandemic has been Europe, it has shifted toward Latin America, which now accounts for most deaths. They have also produced a visualisation of how countries are lifting lockdown measures using the Oxford Stringency Index, produced by the Oxford COVID-19 Government Response Tracker. The Oxford COVID-19 Government Response Tracker collects information on different policy responses governments across the world have taken. Currently, they are tracking 17 measures taken in over 160 countries. The Oxford Stringency Index is a composite score ranging from 0 to 100 which summarizes a number of measures a country has taken (or not taken). In particular, these measures concern (1) school closures, (2) workplace closures, (3) the cancelling of public events, (4) restrictions on gatherings, (5) the closing of public transport, (6) stay at home requirements, (7) restrictions on internal movement, (8) international travel controls, and (9) public information campaigns. These measures differ in their strength, and whether they are applied generally or are targeted; for details, see Hale et al. (2020a). The Oxford Response Tracker is updated frequently, and now also has a Government Response Index and a Containment and Health Index (Hale et al., 2020a). The New York Times also has started producing beautiful visualisations that summarize how the virus is ravaging different parts of the world. I especially like their world map, which not only shows the daily confirmed cases but also the 14-day smoothed trend. Possibly inspired by endcoronavirus.org, the site also gives an overview of where cases are increasing, roughly staying the same, or decreasing. They also provide a more detailed picture of specific countries, showing for example each state and even county of the United States, or the states of India. Finally, ourworldindata.org has what I believe are the most comprehensive COVID-19 visualisations. Another Visualisation Inspired by this Politico piece, Alexandra Rusu, Marcel Schreiner, Aleksandar Tomašević, and I — joining forces through Science versus Corona — set out to work on our own visualisation before much of the excellent work by major newspapers was available. You can find it here. We use the wonderful covid19 R package as a data source. Being written in R and Shiny, our app does not approach the beauty that comes with handcrafting JavaScript; yet it shows a few useful things that some of the above visualisations lack. First, it allows you to explore the evolution of individual measures — such as closing schools and international travel controls — countries have taken instead of reporting only a composite stringency index. Second, our app visualises confirmed cases and confirmed deaths jointly with the stringency index in a single figure. This allows you to explore how they evolve together, and see whether deaths in countries that lift measures quickly rise soon thereafter or not. (You might find that imposing measures causes death, ha!) Third, our app includes a table that lists the individual measures countries are taking, and, if they have done so, when they have lifted them. Individual rows are coloured according to how close each country is to the WHO recommendations for rolling back lockdowns (see Hale et al., 2020b). These WHO recommendations concern whether (1) virus transmission is controlled, (2) testing, tracing, and isolation is performed adequately, (3) outbreak risk in high-risk settings is minimized, (4) preventive measures are established in workplaces, (5) risk of exporting and importing cases from high-risk areas is managed, and (6) the public is engaged, understands that this is the ‘new normal’, and understand that they have a key role in preventing an increase in cases (see WHO, 2020). Data concerning (4) and (5) are not in the Oxford database; we instead use the approach outlined in Hale et al. (2020b). Caveats Importantly, there are a number of caveats associated with interpreting the data we show in the app. First, the number of confirmed cases depends strongly on the number of tests a particular country conducts. Without knowing that, it is foolish to put much trust in comparisons of cases across countries. Hasell et al. (2020) provide a data set and a visualisation of coronavirus testing per country, which is measured in number of tests per confirmed case or by one over that number (the so-called positivity rate). When the number of tests carried out per confirmed case is low, a country does too little testing to adequately monitor the outbreak — the true number of infections is likely much larger. Another caveat concerns deaths. Confirmed deaths provide a clearer lens into how the pandemic unfolds, as every death in a country has to be reported. This is also why e.g. Flaxman et al. (2020) model confirmed deaths rather than confirmed cases to assess the effect of interventions. However, using confirmed deaths to compare how successful countries are in dealing with the virus has limitations as well. Since deaths take at least a week or two to materialize, they are a window into the past, not the present; deaths are thus not a real-time indicator to decide whether to impose or lift measures. There is also large variation in how deaths are reported, both across countries and within time. Some countries only count hospital deaths, for example, thus leading to an underestimate of deaths caused by COVID-19 at home. Or they include only deaths of patients that have tested positively for the virus. Authoritarian regimes might also downplay cases to look better. Moreover, due to delays in reporting, new deaths per day do not necessarily reflect the actual number of deaths that day. Demographics also play an important rule; some countries are much more densely populated, providing easier transmission routes for the virus. Others, such as countries in Africa, have a much younger population, making a severe disease progression less likely (e.g., Clark et al., 2020); with a healthcare system that is much less advanced compared to rich nations, however, Africa may well become the next epicenter of the pandemic (Loembé et al., 2020). All these factors make international comparisons difficult. A different angle on COVID-19’s toll on human life is to calculate excess deaths by subtracting, say, the average number of deaths in the previous five years in a particular time period from the number of deaths during that time period now. Unlike for confirmed deaths, numbers on excess deaths are available only for a selected number of (mostly rich) countries, and there is no central data source. The Economist was one of the first outlets to visualize excess deaths; the Financial Times and the New York Times provide visualisations of excess death, too. Conclusion In this blog post, I have outlined a number of excellent visualisations of the COVID19 pandemic, as well introduced our own. Alexandra Rusu, Marcel Schreiner, and Aleksandar Tomašević — with whom it was an absolute pleasure working with on this — and I are planning to develop the visualisation further, including things such as number of tests, excess deaths, new Oxford indices, etc. and we encourage anybody who is interested to contribute! All the code is available on Github. I want to thank Alexandra Rusu, Marcel Schreiner, and Aleksandar Tomašević for a very enjoyable collaboration. References Clark, Jit, Warren-Gash et al. (2020). Global, regional, and national estimates of the population at increased risk of severe COVID-19 due to underlying health conditions in 2020: A modelling study. The Lancet. Flaxman, Mishra, Gandy et al. (2020). Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature, 3164. Loembé, M. M., Tshangela, A., Salyer, S. J., Varma, J. K., Ouma, A. E. O., & Nkengasong, J. N. (2020). COVID-19 in Africa: the spread and response. Nature Medicine, 1-4. Pichler, A., Pangallo, M., del Rio-Chanona, R. M., Lafond, F., & Farmer, J. D. (2020). Production networks and epidemic spreading: How to restart the UK economy?Interactive exploration of COVID-19 exit strategies2020-06-11T10:30:00+00:002020-06-11T10:30:00+00:00https://fabiandablander.com/r/Covid-Exit<p>The COVID-19 pandemic will end only when a sufficient number of people have become immune, thus preventing future outbreaks. Principally, so-called <em>exit strategies</em> differ on whether immunity is achieved through natural infections, or whether it is achieved through a vaccine. Countries across the world are scrambling to find an adequate exit strategy, with <a href="https://www.endcoronavirus.org/countries">differential success</a>.</p>
<p>To model different exit strategies from an epidemiological standpoint, de Vlas & Coffeng (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>) developed a stochastic individual-based SEIR model which allows for inter-individual differences in how effectively individuals spread the virus and how well individuals adhere to measures designed to curb virus transmission. The model also allows for preferential mixing of individuals with similar contact rates. A key innovation of the model is that it stratifies the population into communities and regions within which transmission mainly occurs. Their paper is excellent and insightful, and I encourage you to read it.</p>
<p>To make the underlying model more easily accessible, <a href="https://twitter.com/luc_coffeng">Luc Coffeng</a> and I have developed a <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a> that allows you to explore these exit strategies interactively. In this blog post, I provide a brief overview of the Shiny app and ideas about possible model extensions. Note that I am not an epidemiologist, and my aim here is not to endorse different exit strategies nor to make policy recommendations.</p>
<p>This work was carried out under the umbrella of <a href="https://scienceversuscorona.com/">Science versus Corona</a>, an initiative I founded together with <a href="https://dennyborsboom.com/">Denny Borsboom</a>, <a href="https://twitter.com/tfblanken">Tessa Blanken</a>, and <a href="https://twitter.com/CharlotteCTanis">Charlotte Tanis</a>.</p>
<h1 id="modeling-exit-strategies">Modeling exit strategies</h1>
<p>The two figures below illustrate particular parameterizations of five different exit strategies: Radical Opening, Phased Lift of Control, Intermittent Lockdown, Flattening the Curve, and Contact Tracing. The first four strategies aim for a (controlled) build-up of herd immunity through natural infection, while Contact Tracing aims to minimize cases until a vaccine is available.</p>
<p>Since the model is stochastic, that is, events in the simulations occur randomly according to pre-defined probabilities, the black solid lines in the figures below show a number of possible trajectories. Note that the dashed vertical lines below indicate interventions, with lines before day 0 indicating interventions specific to the Netherlands during the initial lockdown, and lines from day 0 onwards interventions that are specific to the exit strategies.</p>
<!-- Each figure comprises four panels, showing --- per million --- (a) the number of infectious cases, (b) the number of new cases in intensive care per day, (c) the number of cases present in intensive care (IC), and (d) the proportion of people having recovered. The horizontal dashed line in (a) and (c) indicate the intensive care capacity for the Netherlands; the horizontal line in (d) indicates the herd immunity threshold. The red dots show intensive care data for the Netherlands. Since the model is stochastic, repeated simulations follow slightly different trajectories, as indicated by the black solid lines. -->
<div style="text-align:center;">
<img src="../assets/img/exit-strategies-I.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="850" height="950" />
</div>
<p>Radical Opening lifts all measures at once on day 0, resulting in a huge increase in the number of infections per million, as the top panel shows. The dashed vertical line indicates the number of infections at which the intensive care capacity is reached in the Netherlands, which is 6000 infections per million inhabitants. The second panel shows the simulated number of new cases in intensive care per day, with the red dots showing the actual number of cases in intensive care in the Netherlands. The third panel shows the number of cases that are present in intensive care per million; the dashed vertical line indicates the number of beds per million — 115 — that are available for COVID-19 cases in the Netherlands. Radical Opening massively overshoots this capacity, which would result in a large number of excess deaths. The bottom panel shows that herd immunity is reached quickly, yet <a href="https://www.nytimes.com/2020/05/01/opinion/sunday/coronavirus-herd-immunity.html">overshoots</a>.</p>
<p>Phased Lift of Control, as proposed by de Vlas & Coffeng (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>), splits a country into geographical units and, one at a time, lifts the measures in that part; the time points at which measures are lifted is indicated by the vertical dotted lines. Phased Lift of Control as presented here does not lead to an overburdening of the healthcare system and thus in no excess death as compared to Radical Opening (note the $y$-axis difference). However, the strategy still aims at achieving herd immunity naturally, and so depending on who exactly gets infected, there will be deaths proportional to the case fatality ratio of that subpopulation. Phased Lift of Control allows a natural epidemic within the region where measures are being lifted, and so it overshoots herd immunity regionally and therefore nationally as well, as seen in the bottom panel. As a side note, overshoot does not occur when 25% of the participants “remain in hiding” when control measures are lifted (Luc Coffeng, personal communication), which strikes me as a realistic scenario; overall, Phased Lift of Control is robust to this non-participation (see <a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2.supplementary-material">Supplementary 3</a> in de Vlas & Coffeng, 2020).</p>
<p>The intention of Intermittent Lockdown is to reinstate lockdown measures just before intensive care units are at full capacity. Compared to Phased Lift of Control, the Intermittent Lockdown exit strategy does not use the intensive care capacity efficiently, as some intensive care beds remain unused during periods of lockdown (see days 200 - 600). Moreover, the strategy comes with a high risk of overshooting intensive care capacity (see days 0 - 200 and days 600 - 750).</p>
<div style="text-align:center;">
<img src="../assets/img/exit-strategies-II.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="750" height="550" />
</div>
<p>Flattening the Curve aims to balance the number of infections so that the healthcare system does not become overburdened by relaxing interventions after an initial lockdown. If not enough interventions are lifted (as in this example), herd immunity hardly develops (e.g., see day 400). Conversely, if too many interventions are lifted (or people adhere poorly to interventions), case numbers may increase beyond health care capacity (e.g., see day 500). As the bottom panel shows, this version of Flattening the Curve does not reach herd immunity even after 1200 days.</p>
<p>In contrast to all strategies so far, the Contact Tracing exit strategy does not aim for natural herd immunity. Instead, it aims to keep the number of infections low until a vaccine is developed, with vaccine development being a <a href="https://www.nytimes.com/interactive/2020/06/09/magazine/covid-vaccine.html">highly complex undertaking</a> that may take years. Until that point, due to the low proportion of people who have acquired immunity, large outbreaks are possible at all times, and this is indeed what the figure above shows. There is some debate on how well the testing, tracing, and isolating of infectious and exposed cases will <a href="https://www.sciencemag.org/news/2020/05/countries-around-world-are-rolling-out-contact-tracing-apps-contain-coronavirus-how">work in practice</a>, and you can play around with these parameters in the Shiny app. Heterogeneity might work in our favour, however. Recent estimates suggest that the spread of the novel coronavirus is <a href="https://www.sciencemag.org/news/2020/05/why-do-some-covid-19-patients-infect-many-others-whereas-most-don-t-spread-virus-all">largely driven by superspreading events</a> (see also Althouse et al. <a href="https://arxiv.org/abs/2005.13689">2020</a>), which has <a href="https://www.nytimes.com/2020/06/02/opinion/coronavirus-superspreaders.html">ramifications for control</a>. Heterogeneity in networks that connect individuals can also increase the efficiency of contact tracing (Kojaku et al., <a href="https://arxiv.org/abs/2005.02362">2020</a>).</p>
<p>The <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a> describes these exit strategies and their different parameterizations in more detail, and allows you to interactively compare variations of them. Except Radical Opening, all exit strategies that aim at herd immunity presented above take an extraordinary amount of time to reach it. Indeed, <a href="https://mrc-ide.github.io/covid19estimates/#/total-infected">modeling suggests</a>, and recent seroprevalence studies confirm, that <a href="https://www.nytimes.com/interactive/2020/05/28/upshot/coronavirus-herd-immunity.html">we are far from herd immunity</a>. I am not espousing these types of exit strategies here, and they make me feel a little uneasy (compare <a href="https://medium.com/@tomaspueyo/coronavirus-should-we-aim-for-herd-immunity-like-sweden-b1de3348e88b">the case of Sweden</a>). An assessment of these and other exit strategies that do not aim at herd immunity through natural infection requires input from multiple disciplines, and goes far beyond this blog post and the Shiny app. The goal of the <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a> is instead to allow you to see how robust various exit strategies are to changes in their parameters, and how they compare to each other from a purely epidemiological standpoint.</p>
<h1 id="model-extensions">Model extensions</h1>
<p>The modeling work by de Vlas & Coffeng (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>) is impressive, and I again encourage you to read up on it; see especially their <a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2.supplementary-material">Supplementary 1</a>. Here, I want to briefly mention a number of interesting dimensions along which the model could be extended, with some being more realistic than others.</p>
<p>First, the model currently assumes life-long immunity (or at least for the duration of the simulation), which is unrealistic. Depending on the exact duration of immunity, the dynamics of the exit strategies simulations presentated above will change. For an investigation of how seasonality and immunity might influence the course of the pandemic, see Kissler et al. (<a href="https://science.sciencemag.org/content/368/6493/860">2020</a>).</p>
<p>Second, the model currently does not stratify the population according to age, the most important risk factor for mortality. Extending the model in this way would allow one to model interventions targeted at a particular age group, as well as assess mortalities in a more detailed manner. The model currently also does not simulate mortality, and they have to be computed using the number of infections and an estimate of the case fatality ratio. Needless to say, if the prevalence of people who require intensive care exceeds the intensive care capacity, mortalities will be much higher.</p>
<p>Third, the model assumes that individuals live in clusters (e.g., villages), which a part of super clusters (e.g., provinces), which together make up a country. It allows for heterogeneity among contact rates and preferential mixing of individuals with similar contact behaviour, but currently does not incorporate an explicit network structure. Instead, it assumes that, barring very strong preferential mixing, every individual is connected to every other individual. Adding a network structure would result in more realistic assessment of interventions such as contact tracing, with potentially large ramifications (e.g., Kojaku et al., <a href="https://arxiv.org/abs/2005.02362">2020</a>).</p>
<p>Fourth, the exit strategies presented above are somewhat monolithic. Except for Radical Opening and Contact Tracing, they work by reducing the transmission over a particular period of time in which measures are taken place. Contact Tracing is slightly more involved, and you can read more details in the <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a>. This coarse-grained approach ignores the finer-grained choices governments have to make; should schools be re-opened? What about hairdressers and church services? International travel? A more detailed exploration of the effect of exit strategies would associate each such intervention with a reduction in transmission, and simulate what would happen when they are being lifted or enforced. Needless to say, this requires a good understanding of how such interventions reduce virus spread (see e.g., Chu et al., <a href="https://www.sciencedirect.com/science/article/pii/S0140673620311429">2020</a>), an understanding we are currently lacking. <a href="https://theconversation.com/lockdown-we-need-to-experiment-with-reopenings-now-to-prevent-a-second-wave-138741">Systematic experimentation</a> might help.</p>
<h1 id="multidisciplinary-assessment">Multidisciplinary assessment</h1>
<p>Lastly, the pandemic affects not only the physical health of citizens, but has also inflicted severe economic and psychological damage. While models that focus on a single aspect of the pandemic can yield valuable insights, they should ideally combine different disciplinary perspectives to provide a holistic assessment of exit strategies. Recently, various works have combined economic and epidemiological modeling. For example, using the UK as a case study, Pichler et al. (<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3606984">2020</a>) compare strategies that differ in which sectors they would reopen; even radical opening would reduce the GDP by 16 percentage points compared to pre-lockdown levels, all the while keeping the effective reproductive number $R_t$ above 1.</p>
<p>But there are other disciplines who could chip in besides epidemiology and economics, such as psychology, law, and history. Some would provide a quantitative assessment, for example by formalizing the effect of different interventions such as opening schools or closing churches. What are the epidemiological effects of opening schools? How do school closures adversely affect the educational development of children? In what ways do they increase existing economic inequalities? Others would provide a more qualitative assessment. For example, what are the legal ramifications of “protecting the elderly”, which sounds sensible but has a discriminatory undertone? From a historical perspective, what lessons can we learn from citizens’ behaviour – such as <a href="https://www.theguardian.com/world/2020/apr/29/coronavirus-pandemic-1918-protests-california">anti-mask protests</a> — in past pandemics? All these interventions and effects interact in complex ways, severely complicating analysis; but who said it would be easy?</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, I have described a <a href="https://scienceversuscorona.shinyapps.io/covid-exit/">Shiny app</a> which allows you to interactively explore different exit strategies using the epidemiological model described in de Vlas & Coffeng (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>). I have discussed potential model extensions and the need for a multidisciplinary assessment of exit strategies. Overall, the modeling suggests that exit strategies aimed at the controlled build-up of immunity will take a long time; but so might be <a href="https://www.nytimes.com/interactive/2020/science/coronavirus-vaccine-tracker.html">waiting for a vaccine</a>. Best to brace for the long haul.</p>
<hr />
<p>I want to thank Luc Coffeng for an insightful collaboration and valuable comments on this blog post. Thanks also to Denny Borsboom, Tessa Blanken, and Charlotte Tanis for helpful comments on this blog post and for being a great team.</p>
<hr />
<p><em>This blog post has also been posted to the <a href="https://scienceversuscorona.com/interactive-exploration-of-covid-19-exit-strategies">Science versus Corona blog</a>.</em></p>
<h2 id="references">References</h2>
<ul>
<li>Althouse, B. M., Wenger, E. A., Miller, J. C., Scarpino, S. V., Allard, A., Hébert-Dufresne, L., & Hu, H. (<a href="https://arxiv.org/abs/2005.13689">2020</a>). Stochasticity and heterogeneity in the transmission dynamics of SARS-CoV-2. <em>arXiv preprint arXiv:2005.13689</em>.</li>
<li>Chu, D. K., Akl, E. A., Duda, S., Solo, K., Yaacoub, S., Schünemann, H. J., … & Hajizadeh, A. (2020). Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: A systematic review and meta-analysis. <em>The Lancet</em>.</li>
<li>de Vlas, S. J., & Coffeng, L. E. (<a href="https://www.medrxiv.org/content/10.1101/2020.03.29.20046011v2">2020</a>). A phased lift of control: a practical strategy to achieve herd immunity against Covid-19 at the country level. <em>medRxiv</em>.</li>
<li>Flaxman, Mishra, Gandy et al. (<a href="https://www.nature.com/articles/s41586-020-2405-7">2020</a>) Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. <em>Nature</em>, 3164.</li>
<li>Kojaku, S., Hébert-Dufresne, L., & Ahn, Y. Y. (<a href="https://arxiv.org/abs/2005.02362">2020</a>). The effectiveness of contact tracing in heterogeneous networks. <em>arXiv preprint arXiv:2005.02362</em>.</li>
<li>Pichler, A., Pangallo, M., del Rio-Chanona, R. M., Lafond, F., & Farmer, J. D. (<a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3606984">2020</a>). Production networks and epidemic spreading: How to restart the UK economy?</li>
</ul>Fabian DablanderThe COVID-19 pandemic will end only when a sufficient number of people have become immune, thus preventing future outbreaks. Principally, so-called exit strategies differ on whether immunity is achieved through natural infections, or whether it is achieved through a vaccine. Countries across the world are scrambling to find an adequate exit strategy, with differential success. To model different exit strategies from an epidemiological standpoint, de Vlas & Coffeng (2020) developed a stochastic individual-based SEIR model which allows for inter-individual differences in how effectively individuals spread the virus and how well individuals adhere to measures designed to curb virus transmission. The model also allows for preferential mixing of individuals with similar contact rates. A key innovation of the model is that it stratifies the population into communities and regions within which transmission mainly occurs. Their paper is excellent and insightful, and I encourage you to read it. To make the underlying model more easily accessible, Luc Coffeng and I have developed a Shiny app that allows you to explore these exit strategies interactively. In this blog post, I provide a brief overview of the Shiny app and ideas about possible model extensions. Note that I am not an epidemiologist, and my aim here is not to endorse different exit strategies nor to make policy recommendations. This work was carried out under the umbrella of Science versus Corona, an initiative I founded together with Denny Borsboom, Tessa Blanken, and Charlotte Tanis. Modeling exit strategies The two figures below illustrate particular parameterizations of five different exit strategies: Radical Opening, Phased Lift of Control, Intermittent Lockdown, Flattening the Curve, and Contact Tracing. The first four strategies aim for a (controlled) build-up of herd immunity through natural infection, while Contact Tracing aims to minimize cases until a vaccine is available. Since the model is stochastic, that is, events in the simulations occur randomly according to pre-defined probabilities, the black solid lines in the figures below show a number of possible trajectories. Note that the dashed vertical lines below indicate interventions, with lines before day 0 indicating interventions specific to the Netherlands during the initial lockdown, and lines from day 0 onwards interventions that are specific to the exit strategies. Radical Opening lifts all measures at once on day 0, resulting in a huge increase in the number of infections per million, as the top panel shows. The dashed vertical line indicates the number of infections at which the intensive care capacity is reached in the Netherlands, which is 6000 infections per million inhabitants. The second panel shows the simulated number of new cases in intensive care per day, with the red dots showing the actual number of cases in intensive care in the Netherlands. The third panel shows the number of cases that are present in intensive care per million; the dashed vertical line indicates the number of beds per million — 115 — that are available for COVID-19 cases in the Netherlands. Radical Opening massively overshoots this capacity, which would result in a large number of excess deaths. The bottom panel shows that herd immunity is reached quickly, yet overshoots. Phased Lift of Control, as proposed by de Vlas & Coffeng (2020), splits a country into geographical units and, one at a time, lifts the measures in that part; the time points at which measures are lifted is indicated by the vertical dotted lines. Phased Lift of Control as presented here does not lead to an overburdening of the healthcare system and thus in no excess death as compared to Radical Opening (note the $y$-axis difference). However, the strategy still aims at achieving herd immunity naturally, and so depending on who exactly gets infected, there will be deaths proportional to the case fatality ratio of that subpopulation. Phased Lift of Control allows a natural epidemic within the region where measures are being lifted, and so it overshoots herd immunity regionally and therefore nationally as well, as seen in the bottom panel. As a side note, overshoot does not occur when 25% of the participants “remain in hiding” when control measures are lifted (Luc Coffeng, personal communication), which strikes me as a realistic scenario; overall, Phased Lift of Control is robust to this non-participation (see Supplementary 3 in de Vlas & Coffeng, 2020). The intention of Intermittent Lockdown is to reinstate lockdown measures just before intensive care units are at full capacity. Compared to Phased Lift of Control, the Intermittent Lockdown exit strategy does not use the intensive care capacity efficiently, as some intensive care beds remain unused during periods of lockdown (see days 200 - 600). Moreover, the strategy comes with a high risk of overshooting intensive care capacity (see days 0 - 200 and days 600 - 750). Flattening the Curve aims to balance the number of infections so that the healthcare system does not become overburdened by relaxing interventions after an initial lockdown. If not enough interventions are lifted (as in this example), herd immunity hardly develops (e.g., see day 400). Conversely, if too many interventions are lifted (or people adhere poorly to interventions), case numbers may increase beyond health care capacity (e.g., see day 500). As the bottom panel shows, this version of Flattening the Curve does not reach herd immunity even after 1200 days. In contrast to all strategies so far, the Contact Tracing exit strategy does not aim for natural herd immunity. Instead, it aims to keep the number of infections low until a vaccine is developed, with vaccine development being a highly complex undertaking that may take years. Until that point, due to the low proportion of people who have acquired immunity, large outbreaks are possible at all times, and this is indeed what the figure above shows. There is some debate on how well the testing, tracing, and isolating of infectious and exposed cases will work in practice, and you can play around with these parameters in the Shiny app. Heterogeneity might work in our favour, however. Recent estimates suggest that the spread of the novel coronavirus is largely driven by superspreading events (see also Althouse et al. 2020), which has ramifications for control. Heterogeneity in networks that connect individuals can also increase the efficiency of contact tracing (Kojaku et al., 2020). The Shiny app describes these exit strategies and their different parameterizations in more detail, and allows you to interactively compare variations of them. Except Radical Opening, all exit strategies that aim at herd immunity presented above take an extraordinary amount of time to reach it. Indeed, modeling suggests, and recent seroprevalence studies confirm, that we are far from herd immunity. I am not espousing these types of exit strategies here, and they make me feel a little uneasy (compare the case of Sweden). An assessment of these and other exit strategies that do not aim at herd immunity through natural infection requires input from multiple disciplines, and goes far beyond this blog post and the Shiny app. The goal of the Shiny app is instead to allow you to see how robust various exit strategies are to changes in their parameters, and how they compare to each other from a purely epidemiological standpoint. Model extensions The modeling work by de Vlas & Coffeng (2020) is impressive, and I again encourage you to read up on it; see especially their Supplementary 1. Here, I want to briefly mention a number of interesting dimensions along which the model could be extended, with some being more realistic than others. First, the model currently assumes life-long immunity (or at least for the duration of the simulation), which is unrealistic. Depending on the exact duration of immunity, the dynamics of the exit strategies simulations presentated above will change. For an investigation of how seasonality and immunity might influence the course of the pandemic, see Kissler et al. (2020). Second, the model currently does not stratify the population according to age, the most important risk factor for mortality. Extending the model in this way would allow one to model interventions targeted at a particular age group, as well as assess mortalities in a more detailed manner. The model currently also does not simulate mortality, and they have to be computed using the number of infections and an estimate of the case fatality ratio. Needless to say, if the prevalence of people who require intensive care exceeds the intensive care capacity, mortalities will be much higher. Third, the model assumes that individuals live in clusters (e.g., villages), which a part of super clusters (e.g., provinces), which together make up a country. It allows for heterogeneity among contact rates and preferential mixing of individuals with similar contact behaviour, but currently does not incorporate an explicit network structure. Instead, it assumes that, barring very strong preferential mixing, every individual is connected to every other individual. Adding a network structure would result in more realistic assessment of interventions such as contact tracing, with potentially large ramifications (e.g., Kojaku et al., 2020). Fourth, the exit strategies presented above are somewhat monolithic. Except for Radical Opening and Contact Tracing, they work by reducing the transmission over a particular period of time in which measures are taken place. Contact Tracing is slightly more involved, and you can read more details in the Shiny app. This coarse-grained approach ignores the finer-grained choices governments have to make; should schools be re-opened? What about hairdressers and church services? International travel? A more detailed exploration of the effect of exit strategies would associate each such intervention with a reduction in transmission, and simulate what would happen when they are being lifted or enforced. Needless to say, this requires a good understanding of how such interventions reduce virus spread (see e.g., Chu et al., 2020), an understanding we are currently lacking. Systematic experimentation might help. Multidisciplinary assessment Lastly, the pandemic affects not only the physical health of citizens, but has also inflicted severe economic and psychological damage. While models that focus on a single aspect of the pandemic can yield valuable insights, they should ideally combine different disciplinary perspectives to provide a holistic assessment of exit strategies. Recently, various works have combined economic and epidemiological modeling. For example, using the UK as a case study, Pichler et al. (2020) compare strategies that differ in which sectors they would reopen; even radical opening would reduce the GDP by 16 percentage points compared to pre-lockdown levels, all the while keeping the effective reproductive number $R_t$ above 1. But there are other disciplines who could chip in besides epidemiology and economics, such as psychology, law, and history. Some would provide a quantitative assessment, for example by formalizing the effect of different interventions such as opening schools or closing churches. What are the epidemiological effects of opening schools? How do school closures adversely affect the educational development of children? In what ways do they increase existing economic inequalities? Others would provide a more qualitative assessment. For example, what are the legal ramifications of “protecting the elderly”, which sounds sensible but has a discriminatory undertone? From a historical perspective, what lessons can we learn from citizens’ behaviour – such as anti-mask protests — in past pandemics? All these interventions and effects interact in complex ways, severely complicating analysis; but who said it would be easy? Conclusion In this blog post, I have described a Shiny app which allows you to interactively explore different exit strategies using the epidemiological model described in de Vlas & Coffeng (2020). I have discussed potential model extensions and the need for a multidisciplinary assessment of exit strategies. Overall, the modeling suggests that exit strategies aimed at the controlled build-up of immunity will take a long time; but so might be waiting for a vaccine. Best to brace for the long haul. I want to thank Luc Coffeng for an insightful collaboration and valuable comments on this blog post. Thanks also to Denny Borsboom, Tessa Blanken, and Charlotte Tanis for helpful comments on this blog post and for being a great team. This blog post has also been posted to the Science versus Corona blog. References Althouse, B. M., Wenger, E. A., Miller, J. C., Scarpino, S. V., Allard, A., Hébert-Dufresne, L., & Hu, H. (2020). Stochasticity and heterogeneity in the transmission dynamics of SARS-CoV-2. arXiv preprint arXiv:2005.13689. Chu, D. K., Akl, E. A., Duda, S., Solo, K., Yaacoub, S., Schünemann, H. J., … & Hajizadeh, A. (2020). Physical distancing, face masks, and eye protection to prevent person-to-person transmission of SARS-CoV-2 and COVID-19: A systematic review and meta-analysis. The Lancet. de Vlas, S. J., & Coffeng, L. E. (2020). A phased lift of control: a practical strategy to achieve herd immunity against Covid-19 at the country level. medRxiv. Flaxman, Mishra, Gandy et al. (2020) Estimating the effects of non-pharmaceutical interventions on COVID-19 in Europe. Nature, 3164. Kojaku, S., Hébert-Dufresne, L., & Ahn, Y. Y. (2020). The effectiveness of contact tracing in heterogeneous networks. arXiv preprint arXiv:2005.02362. Pichler, A., Pangallo, M., del Rio-Chanona, R. M., Lafond, F., & Farmer, J. D. (2020). Production networks and epidemic spreading: How to restart the UK economy?Infectious diseases and nonlinear differential equations2020-03-22T12:30:00+00:002020-03-22T12:30:00+00:00https://fabiandablander.com/r/Nonlinear-Infection<p>Last summer, I wrote about <a href="https://fabiandablander.com/r/Linear-Love.html">love affairs and linear differential equations</a>. While the topic is cheerful, linear differential equations are severely limited in the types of behaviour they can model. In this blog post, which I spent writing in self-quarantine to prevent further spread of SARS-CoV-2 — take that, cheerfulness — I introduce nonlinear differential equations as a means to model infectious diseases. In particular, we will discuss the simple SIR and SIRS models, the building blocks of many of the more complicated models used in epidemiology.</p>
<p>Before doing so, however, I discuss some of the basic tools of nonlinear dynamics applied to the logistic equation as a model for population growth. If you are already familiar with this, you can skip ahead. If you have had no prior experience with differential equations, I suggest you first check out my <a href="https://fabiandablander.com/r/Linear-Love.html">earlier post</a> on the topic.</p>
<p>I should preface this by saying that I am not an epidemiologist, and that no analysis I present here is specifically related to the current SARS-CoV-2 pandemic, nor should anything I say be interpreted as giving advice or making predictions. I am merely interested in differential equations, and as with love affairs, infectious diseases make a good illustrating case. So without further ado, let’s dive in!</p>
<h1 id="modeling-population-growth">Modeling Population Growth</h1>
<p>Before we start modeling infectious diseases, it pays to study the concepts required to study nonlinear differential equations on a simple example: modeling population growth. Let $N > 0$ denote the size of a population and assume that its growth depends on itself:</p>
<script type="math/tex; mode=display">\frac{dN}{dt} = \dot{N} = r N \enspace .</script>
<p>As shown in a <a href="https://fabiandablander.com/r/Linear-Love.html">previous blog post</a>, this leads to exponential growth for $r > 0$:</p>
<script type="math/tex; mode=display">N(t) = N_0 e^{r t} \enspace ,</script>
<p>where $N_0 = N(0)$ is the initial population size at time $t = 0$. The figure below visualizes the differential equation (left panel) and its solution (right panel) for $r = 1$ and an initial population of $N_0 = 2$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>This is clearly not a realistic model since the growth of a population depends on resources, which are finite. To model finite resources, we write:</p>
<script type="math/tex; mode=display">\dot{N} = rN \left(1 - \frac{N}{K}\right) \enspace ,</script>
<p>where $r > 0$ and $K$ is the so-called <em>carrying capacity</em>, that is, the maximum sized population that can be sustained by the available resources. Observe that as $N$ grows and if $K > N$, then $(1 - N / K)$ gets smaller, slowing down the growth rate $\dot{N}$. If on the other hand $N > K$, then the population needs more resources than are available, and the growth rate becomes negative, resulting in population decrease.</p>
<p>For simplicity, let $K = 1$ and interpret $N \in [0, 1]$ as the proportion with respect to the carrying capacity; that is, $N = 1$ implies that we are at carrying capacity. The figure below visualizes the differential equation and its solution for $r = 1$ and an initial condition $N_0 = 0.10$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In contrast to exponential growth, the logistic equation leads to sigmoidal growth which approaches the carrying capacity. This is much more interesting behaviour than the linear differential equation above allows. In particular, the logistic equation has two <em>fixed points</em> — points at which the population neither increases nor decreases but stays fixed, that is, where $\dot{N} = 0$. These occur at $N = 0$ and at $N = 1$, as can be inferred from the left panel in the figure above.</p>
<h2 id="analyzing-the-stability-of-fixed-points">Analyzing the Stability of Fixed Points</h2>
<p>What is the stability of these fixed points? Intuitively, $N = 0$ should be unstable; if there are individuals, then they procreate and the population increases. Similarly, $N = 1$ should be stable: if $N < 1$, then $\dot{N} > 0$ and the population grows towards $N = 1$, and if $N > 1$, then $\dot{N} < 0$ and individuals die until $N = 1$.</p>
<p>To make this argument more rigorous, and to get a more quantitative assessment of how quickly perturbations move away from or towards a fixed point, we derive a differential equation for these small perturbations close to the fixed point (see also Strogatz, 2015, p. 24). Let $N^{\star}$ denote a fixed point and define $\eta(t) = N(t) - N^{\star}$ to be a small perturbation close to the fixed point. We derive a differential equation for $\eta$ by writing:</p>
<script type="math/tex; mode=display">\frac{d\eta}{dt} = \frac{d}{dt}\left(N(t) - N^{\star}\right) = \frac{dN}{dt} \enspace ,</script>
<p>since $N^{\star}$ is a constant. This implies that the dynamics of the perturbation equal the dynamics of the population. Let $f(N)$ denote the differential equation for $N$, observe that $N = N^{\star} + \eta$ such that $\dot{N} = \dot{\eta} = f(N) = f(N^{\star} + \eta)$. Recall that $f$ is a nonlinear function, and nonlinear functions are messy to deal with. Thus, we simply pretend that the function is linear close to the fixed point. More precisely, we approximate $f$ around the fixed point using a Taylor series (see <a href="https://www.youtube.com/watch?v=3d6DsjIBzJ4">this excellent video</a> for details) by writing:</p>
<script type="math/tex; mode=display">f(N^{\star} + \eta) = f(N^{\star}) + \eta f'(N^{\star}) + \mathcal{O}(\eta^2) \enspace ,</script>
<p>where we have ignored higher order terms. Note that, by definition, there is no change at the fixed point, that is, $f(N^{\star}) = 0$. Assuming that $f’(N^{\star}) \neq 0$ — as otherwise the higher-order terms matter, as there would be nothing else — we have that close to a fixed point</p>
<script type="math/tex; mode=display">\dot{\eta} \approx \eta f'(N^{\star}) \enspace ,</script>
<p>which is a linear differential equation with solution:</p>
<script type="math/tex; mode=display">\eta(t) = \eta_0 e^{f'(N^{\star})t} \enspace .</script>
<p>Using this trick, we can assess the stability of $N^{\star}$ as follows. If $f’(N^{\star}) < 0$, the small perturbation $\eta(t)$ around the fixed point decays towards zero, and so the system returns to the fixed point — the fixed point is stable. On the other hand, if $f’(N^{\star}) > 0$, then the small perturbation $\eta(t)$ close to the fixed point grows, and so the system does not return to the fixed point — the fixed point is unstable. Applying this to our logistic equation, we see that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
f'(N) &= \frac{d}{dN} \left(rN(1 - N)\right) \\[0.50em]
&= \frac{d}{dN} \left(rN - rN^2\right) \\[0.50em]
& = r - 2rN \\[0.50em]
&= r(1 - 2N) \enspace .
\end{aligned} %]]></script>
<p>Plugging in our two fixed points $N^{\star} = 0$ and $N^{\star} = 1$, we find that $f’(0) = r$ and $f’(1) = -r$. Since $r > 0$, this confirms our suspicion that $N^{\star} = 0$ is unstable and $N^{\star} = 1$ is stable. In addition, this analysis tells us how quickly the perturbations grow or decay; for the logistic equation, this is given by $r$.</p>
<p>In sum, we have linearized a nonlinear system close to fixed points in order to assess the stability of these fixed points, and how quickly perturbations close to these fixed points grow or decay. This technique is called <em>linear stability analysis</em>. In the next two sections, we discuss two ways to solve differential equations using the logistic equation as an example.</p>
<h2 id="analytic-solution">Analytic Solution</h2>
<p>In contrast to linear differential equations, which was the topic of a <a href="https://fabiandablander.com/r/Linear-Love.html">previous blog post</a>, nonlinear differential equations can usually not be solved analytically; that is, we generally cannot get an expression that, given an initial condition, tells us the state of the system at any time point $t$. The logistic equation can, however, be solved analytically and it might be instructive to see how. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{dN}{dt} &= rN (1 - N) \\
\frac{dN}{N(1 - N)} &= r dt \\
\int \frac{1}{N(1 - N)} dN &= r t \enspace .
\end{aligned} %]]></script>
<p>Staring at this for a bit, we realize that we can use partial fractions to split the integral. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\int \frac{1}{N(1 - N)} dN &= r t \\[0.50em]
\int \frac{1}{N} dN + \int \frac{1}{1 - N}dN &= rt \\[0.50em]
\text{log}N - \text{log}(1 - N) + Z &= rt \\[0.50em]
e^{\text{log}N - \text{log}(1 - N) + Z} &= e^{rt} \enspace .
\end{aligned} %]]></script>
<p>The exponents and the logs cancel each other nicely. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{e^{\text{log}N}}{e^{\text{log}(1 - N)}}e^Z &= e^{rt} \\[0.50em]
\frac{N}{1 - N} e^Z &= e^{rt} \\[0.50em]
\frac{N}{1 - N} &= e^{rt - Z} \\[0.50em]
N &= e^{rt - Z} - N e^{rt - Z} \\[0.50em]
N\left(1 + e^{rt - Z}\right) &= e^{rt - Z} \\[0.50em]
N &= \frac{e^{rt - Z}}{1 + e^{rt - Z}} \enspace .
\end{aligned} %]]></script>
<p>One last trick is to multiply by $e^{-rt + Z}$, which yields:</p>
<script type="math/tex; mode=display">N = \frac{\left(e^{-rt + Z}\right)\left(e^{rt - Z}\right)}{\left(e^{-rt + Z}\right) + {\left(e^{-rt + Z}\right)\left(e^{-rt + Z}\right)}} = \frac{1}{1 + e^{-rt + Z}} \enspace ,</script>
<p>where $Z$ is the constant of integration. To solve for it, we need the initial condition. Suppose that $N(0) = N_0$, which, using the third line in the derivation above and the fact that $t = 0$, leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{log}N_0 - \text{log}(1 - N_0) + Z &= 0 \\[0.50em]
\text{log}N_0 - \text{log}(1 - N_0) &= -Z \\[0.50em]
\frac{N_0}{1 - N_0} = e^{-Z} \\[0.50em]
\frac{1 - N_0}{N_0} = e^{Z} \enspace .
\end{aligned} %]]></script>
<p>Plugging this into our solution from above yields:</p>
<script type="math/tex; mode=display">N(t) = \frac{1}{1 + e^{-rt + Z}} = \frac{1}{1 + \frac{1 - N_0}{N_0} e^{-rt}} \enspace .</script>
<p>While this was quite a hassle, other nonlinear differential equations are much, much harder to solve, and most do not admit a closed-form solution — or at least if they do, the resulting expression is generally not very intuitive. Luckily, we can compute the time-evolution of the system using numerical methods, as illustrated in the next section.</p>
<h2 id="numerical-solution">Numerical Solution</h2>
<p>A differential equation implicitly encodes how the system we model changes over time. Specifically, given a particular (potentially high-dimensional) state of the system at time point $t$, $\mathbf{x}_t$, we know in which direction and how quickly the system will change because this is exactly what is encoded in the differential equation $f = \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t}$. This suggests the following numerical approximation: Assume we know the state of the system at a (discrete) time point $n$, denoted $x_n$, and that the change in the system is constant over a small interval $\Delta_t$. Then, the position of the system at time point $n + 1$ is given by:</p>
<script type="math/tex; mode=display">\mathbf{x}_{n + 1} = \mathbf{x}_n + \Delta t \cdot f(\mathbf{x}_n) \enspace .</script>
<p>$\Delta t$ is an important parameter, encoding over what time period we assume the change $f$ to be constant. We can code this up in R for the logistic equation:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_logistic</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N0</span><span class="p">,</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">N</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="n">N0</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">)</span><span class="w">
</span><span class="n">dN</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">N</span><span class="p">)</span><span class="w"> </span><span class="n">r</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">N</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">N</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># Euler</span><span class="w">
</span><span class="n">N</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">N</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dN</span><span class="p">(</span><span class="n">N</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">])</span><span class="w">
</span><span class="c1"># Improved Euler</span><span class="w">
</span><span class="c1"># k <- N[i-1] + delta_t * dN(N[i-1])</span><span class="w">
</span><span class="c1"># N[i] <- N[i-1] + 1 /2 * delta_t * (dN(N[i-1]) + dN(k))</span><span class="w">
</span><span class="c1"># Runge-Kutta 4th order</span><span class="w">
</span><span class="c1"># k1 <- dN(N[i-1]) * delta_t</span><span class="w">
</span><span class="c1"># k2 <- dN(N[i-1] + k1/2) * delta_t</span><span class="w">
</span><span class="c1"># k3 <- dN(N[i-1] + k2/2) * delta_t</span><span class="w">
</span><span class="c1"># k4 <- dN(N[i-1] + k3) * delta_t</span><span class="w">
</span><span class="c1">#</span><span class="w">
</span><span class="c1"># N[i] <- N[i-1] + 1/6 * (k1 + 2*k2 + 2*k3 + k4)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">N</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Clearly, the accuracy of this approximation is a function of $\Delta t$. To see how, the left panel shows the approximation for various values of $\Delta t$, while the right panel shows the (log) absolute error as a function of (log) $\Delta t$. The error is defined as:</p>
<script type="math/tex; mode=display">E = |N(10) - \hat{N}(10)| \enspace ,</script>
<p>where $\hat{N}$ is the Euler approximation.</p>
<!-- The figure gives some intuition how the accuracy of the approximation changes as we change $\Delta_t$ and the approximation method. In particular, the left panel shows the Euler approximation for various $\Delta t$, while the right panel shows the approximation for the Runga-Kutta method (see commented out code above). -->
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>The right panel approximately shows the relationship:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{log } E &\propto \text{log } \Delta t \\[0.50em]
E &\propto \Delta t \enspace .
\end{aligned} %]]></script>
<p>Therefore, the error goes down linearly with $\Delta t$. Other methods, such as the improved Euler method or <a href="https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods">Runge-Kutta solvers</a> (see commented out code above) do better. However, it is ill-advised to choose $\Delta t$ extremely small, because this leads to an increase in computation time and can lead to accuracy errors which get exacerbated over time.</p>
<!-- We see that the [Runge-Kutta method](https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods) (of $4^{\text{th}}$ order) performs better. While the figure shows that the error is drastically reduced with smaller step sizes $\Delta t$, it is ill-advised to choose $\Delta t$ extremely small: Decreasing $\Delta t$ leads to comparatively more computations, and this increases computation time but also can lead to accuracy errors which get exacerbated over time. -->
<p>In summary, we have seen that nonlinear differential equations can model interesting behaviour such as multiple fixed points; how to classify the stability of these fixed points using linear stability analysis; and how to numerically solve nonlinear differential equations. In the remainder of this post, we study coupled nonlinear differential equations — the SIR and SIRS models — as a way to model the spread of infectious diseases.</p>
<h1 id="modeling-infectious-diseases">Modeling Infectious Diseases</h1>
<p>Many models have been proposed as tools to understand epidemics. In the following sections, I focus on the two simplest ones: the SIR and the SIRS model (see also Hirsch, Smale, Devaney, 2013, ch. 11).</p>
<h2 id="the-sir-model">The SIR Model</h2>
<p>We use the SIR model to understand the spread of infectious diseases. The SIR model is the most basic <em>compartmental</em> model, meaning that it groups the overall population into distinct sub-populations: a susceptible population $S$, an infected population $I$, and a recovered population $R$. We make a number of further simplifying assumptions. First, we assume that the overall population is $1 = S + I + R$ so that $S$, $I$, and $R$ are proportions. We further assume that the overall population does not change, that is,</p>
<script type="math/tex; mode=display">\frac{d}{dt} \left(S + I + R\right) = 0 \enspace .</script>
<p>Second, the SIR model assumes that once a person has been infected and has recovered, the person cannot become infected again — we will relax this assumption later on. Third, the model assumes that the rate of transmission of the disease is proportional to the number of encounters between susceptible and infected persons. We model this by setting</p>
<script type="math/tex; mode=display">\frac{dS}{dt} = - \beta IS \enspace ,</script>
<p>where $\beta > 0$ is the rate of infection. Fourth, the model assumes that the growth of the recovered population is proportional to the proportion of people that are infected, that is,</p>
<script type="math/tex; mode=display">\frac{dR}{dt} = \gamma I \enspace ,</script>
<p>where the $\gamma > 0$ is the recovery rate. Since the overall population is constant, these two equations naturally lead to the following equation for the infected:</p>
<script type="math/tex; mode=display">\begin{aligned}
\frac{d}{dt} \left(S + I + R\right) = 0 \\[0.50em]
\frac{dI}{dt} = - \frac{dS}{dt} - \frac{dR}{dt} \\[0.50em]
\frac{dI}{dt} = \beta IS - \gamma I \enspace .
\end{aligned}</script>
<p>where $\beta I S$ gives the proportion of newly infected individuals and $\gamma I$ gives the proportion of newly recovered individuals. Observe that since we assumed that the overall population does not change, we only need to focus on two of these subgroup, since $R(t) = 1 - S(t) - I(t)$. The system is therefore fully characterized by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{dS}{dt} &= - \beta IS \\[0.50em]
\frac{dI}{dt} &= \beta IS - \gamma I \enspace .
\end{aligned} %]]></script>
<p>Before we analyze this model mathematically, let’s implement Euler’s method and visualize some trajectories.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_SIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="w">
</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">dimnames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Time'</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">S0</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dS</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">i</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot_SIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="s1">'Set1'</span><span class="p">)</span><span class="w">
</span><span class="n">time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w">
</span><span class="n">matplot</span><span class="p">(</span><span class="w">
</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">],</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Subpopulations'</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Days'</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tail</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)),</span><span class="w">
</span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.75</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w">
</span><span class="n">xaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">yaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="w">
</span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="m">0.65</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">),</span><span class="w">
</span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below shows trajectories for a fixed recovery rate of $\gamma = 1/8$ and an increasing rate of infection $\beta$ for the initial condition $S_0 = 0.95$, $I_0 = 0.05$, and $R_0 = 0$. We take a time step $\Delta t = 1$ to denote one day. (Unfortunately, epidemics take much longer in real life.)</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>For $\beta = 1/8$, no outbreak occurs (left panel). Instead, the proportion of susceptible and infected people monotonically decrease while the proportion of recovered people monotonically increases. The middle panel, on the other hand, shows a small outbreak. The proportion of infected people rises, but then falls again. Similarly, the right panel shows an outbreak as well, but a more severe one, as the proportion of infected people rises more starkly before it eventually decreases again.</p>
<p>How do things change when we change the recovery rate $\gamma$? The figure below shows again three cases of trajectories for the same initial condition, but for a smaller recovery rate $\gamma = 1/12$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>We again observe no outbreak in the left panel, and outbreaks of increasing severity in both the middle and the right panel. In contrast to the results for $\gamma = 1/8$, the outbreak is more severe, as we would expect since the recovery rate with $\gamma = 1/12$ is now lower. In fact, whether an outbreak occurs or not and how severe it will be depends not on $\beta$ and $\gamma$ alone, but on their ratio. This ratio is known as $R_0 = \beta / \gamma$, pronounced “R-naught”. (Note the unfortunate choice of well-established terminology in this context, as $R_0$ also denotes the initial proportion of recovered people; it should be clear from the context which one is meant, however.) We can think of $R_0$ as the average number of people an infected person will infect before she gets better (assuming a population that is fully susceptible). If $R_0 > 1$, an outbreak occurs. In the next section, we look for the fixed points of this system and assess their stability.</p>
<h2 id="analyzing-fixed-points">Analyzing Fixed Points</h2>
<p>A glance at the above figures suggests that the SIR model allows for multiple stable states. The left panels, for example, show that if there is no outbreak, the proportion of susceptible people stays above the proportion of recovered people. If there is an outbreak, however, then it always fades and the proportion of recovered people will be higher than the proportion of susceptible people; how much higher depends on the severity of the outbreak.</p>
<p>While we could play around some more with visualisations, it pays to do a formal analysis. Note that in contrast to the logistic equation, which only modelled a single variable — population size — an analysis of the SIR model requires us to handle two variables, $S$ and $I$; the third one, $R$, follows from the assumption of a constant population size. At the fixed points, nothing changes, that is, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= - \beta IS \\[0.50em]
0 &= \beta IS - \gamma I \enspace .
\end{aligned} %]]></script>
<p>This can only happen when $I = 0$, irrespective of the value of $S$. In other words, all $(I^{\star}, S^{\star}) = (0, S)$ are fixed points; if nobody is infected, the disease cannot spread — and so everybody stays either susceptible or recovered. To assess the stability of these fixed points, we again derive a differential equation for the perturbations close to the fixed point. However, note that in contrast to the one-dimensional case studied above, perturbations can now be with respect to $I$ or to $S$. Let $u = S - S^{\star}$ and $v = I - I^{\star}$ be the respective perturbations, and let $\dot{S} = f(S, I)$ and $\dot{I} = g(S, I)$. We first derive a differential equation for $u$, writing:</p>
<script type="math/tex; mode=display">\dot{u} = \frac{d}{dt}\left(S - S^{\star}\right) = \dot{S} \enspace ,</script>
<p>since $S^{\star}$ is a constant. This implies that $u$ behaves as $S$. In contrast to the one-dimensional case above, we have two <em>coupled</em> differential equations, and so we have to take into account how $u$ changes as a function of both $S$ and $I$. We Taylor expand at the fixed point $(S^{\star}, I^{\star})$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\dot{u} &= f(u + S^{\star}, v + I^{\star}) \\[0.50em]
&= f(S^{\star}, I^{\star}) + u \frac{\partial f}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial f}{\partial I}_{(S^{\star}, I^{\star})} + \mathcal{O}(u^2, v^2, uv) \\[0.50em]
&\approx u \frac{\partial f}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial f}{\partial I}_{(S^{\star}, I^{\star})} \enspace ,
\end{aligned} %]]></script>
<p>since $f(S^{\star}, I^{\star}) = 0$ and we drop higher-order terms. Note that taking the partial derivative of $f$ with respect to $S$ (or $I$) yields a function, and the subscripts $(S^{\star}, I^{\star})$ mean that we evaluate this function at the fixed point $(S^{\star}, I^{\star})$. We can similarly derive a differential equation for $v$:</p>
<script type="math/tex; mode=display">\dot{v} \approx u \frac{\partial g}{\partial S}_{(S^{\star}, I^{\star})} + v \frac{\partial g}{\partial I}_{(S^{\star}, I^{\star})} \enspace .</script>
<p>We can write all of this concisely using matrix algebra:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix}
\dot{u} \\
\dot{v}
\end{pmatrix} =
\begin{pmatrix}
\frac{\partial f}{\partial S} & \frac{\partial f}{\partial I} \\
\frac{\partial g}{\partial S} & \frac{\partial g}{\partial I}
\end{pmatrix}_{(S^{\star}, I^{\star})}
\begin{pmatrix}
u \\
v
\end{pmatrix} \enspace , %]]></script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
J = \begin{pmatrix}
\frac{\partial f}{\partial S} & \frac{\partial f}{\partial I} \\
\frac{\partial g}{\partial S} & \frac{\partial g}{\partial I}
\end{pmatrix}_{(S^{\star}, I^{\star})} %]]></script>
<p>is called the <em>Jacobian matrix</em> at the fixed point $(S^{\star}, I^{\star})$. The Jacobian gives the linearized dynamics close to a fixed point, and therefore tells us how perturbations will evolve close to a fixed point.</p>
<p>In contrast to unidimensional systems, where we simply check whether the slope is positive or negative, that is, whether $f’(x^\star) < 0$ or $f’(x^\star) > 0$, the test for whether a fixed point is stable is slightly more complicated in multidimensional settings. In fact, and not surprisingly, since we have <em>linearized</em> this nonlinear differential equation, the check is the same as in <a href="https://fabiandablander.com/r/Linear-Love.html">linear systems</a>: we compute the eigenvalues $\lambda_1$ and $\lambda_2$ of $J$, observing that negative eigenvalues mean exponential decay and positive eigenvalues mean exponential growth along the directions of the respective eigenvectors. (Note that this does not work for all types of fixed points, see Strogatz (2015, p. 152).)</p>
<p>What does this mean for our SIR model? First, let’s derive the Jacobian:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
J &= \begin{pmatrix}
-\frac{\partial}{\partial S} \beta I S & -\frac{\partial }{\partial I} \beta I S \\
\frac{\partial}{\partial S} \left(\beta I S - \gamma I\right) & \frac{\partial}{\partial I} \left(\beta I S - \gamma I\right) \\[0.5em]
\end{pmatrix} \\[1em]
& =
\begin{pmatrix}
-\beta I & -\beta S \\
\beta I & \beta S - \gamma
\end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>Evaluating this at the fixed point $(S^{\star}, I^{\star}) = (S, 0)$ results in:</p>
<script type="math/tex; mode=display">% <![CDATA[
J_{(S, 0)} = \begin{pmatrix} 0 & -\beta S \\ 0 & \beta S - \gamma \end{pmatrix} \enspace . %]]></script>
<p>Since this matrix is upper triangular — all entries below the diagonal are zero — the eigenvalues are given by the diagonal, that is, $\lambda_1 = 0$ and $\lambda_2 = \beta S - \gamma$. $\lambda_1 = 0$ implies a constant solution, while $\lambda_2 > 0$ implies exponential growth and $\lambda_2 < 0$ exponential decay of the perturbations close to the fixed point. Observe that $\lambda_2$ is not only a function of the parameters $\beta$ and $\gamma$, but also of the proportion of susceptible individuals $S$. We find that $\lambda_2 > 0$ for $S > \gamma / \beta$, which results in an unstable fixed point. On the other hand, we have that $\lambda_2 < 0$ for $S < \gamma / \beta$, which results in a stable fixed point. In the next section, we will use vector fields in order to get more intuition for the dynamics of the system.</p>
<h2 id="vector-field-and-nullclines">Vector Field and Nullclines</h2>
<p>A vector field shows for any position $(S, I)$ in which direction the system moves, which we indicate by the head of an arrow, and how quickly, which we indicate by the length of an arrow. We use the R code below to visualize such a vector field and selected trajectories on it.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'fields'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_vectorfield_SIR</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">expand.grid</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">))</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">SI</span><span class="p">[</span><span class="n">apply</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="c1"># S + I <= 1 must hold</span><span class="w">
</span><span class="n">dSI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">dS</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
</span><span class="n">draw_vectorfield</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">draw_vectorfield</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="w">
</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w">
</span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.2</span><span class="p">,</span><span class="w"> </span><span class="m">1.2</span><span class="p">),</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-0.1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-0.1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">arrow.plot</span><span class="p">(</span><span class="w">
</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w">
</span><span class="n">arrow.ex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.075</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.05</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray82'</span><span class="p">,</span><span class="w"> </span><span class="n">xpd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">cx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="n">cn</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">1.05</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="m">-.075</span><span class="p">,</span><span class="w"> </span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cn</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.05</span><span class="p">,</span><span class="w"> </span><span class="m">0.5</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cn</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.03</span><span class="p">,</span><span class="w"> </span><span class="m">-.04</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cx</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.03</span><span class="p">,</span><span class="w"> </span><span class="m">.975</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cx</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">0.995</span><span class="p">,</span><span class="w"> </span><span class="m">-0.04</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cx</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>For $\beta = 1/8$ and $\gamma = 1/8$, we know from above that no outbreak occurs. The vector field shown in the left panel below further illustrates that, since $S \leq \gamma / \beta = 1$, all fixed points $(S^{\star}, I^{\star}) = (S, 0)$ are stable. In contrast, we know that $\beta = 3/8$ and $\gamma = 1/8$ result in an outbreak. The vector field shown in the right panel below indicates that fixed points with $S > \gamma / \beta = 1/3$ are unstable, while fixed points with $S < 1/3$ are stable; the dotted line is $S = 1/3$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-9-1.png" title="plot of chunk unnamed-chunk-9" alt="plot of chunk unnamed-chunk-9" style="display: block; margin: auto;" /></p>
<p>Can we find some structure in such vector fields? One way to “organize” them is by drawing so-called <em>nullclines</em>. In our case, the $I$-nullcline gives the set of points for which $\dot{I} = 0$, and the $S$-nullcline gives the set of points for which $\dot{S} = 0$. We find these points in a similar manner to finding fixed points, but instead of setting both $\dot{S}$ and $\dot{I}$ to zero, we tackle them one at a time.</p>
<p>The $S$-nullclines are given by the $S$- and the $I$-axes, because $\dot{S} = 0$ when $S = 0$ or when $I = 0$. Along the $I$-axis axis we have $\dot{I} = - \gamma I$ since $S = 0$, resulting in exponential decay of the infected population; this indicated by the grey arrows along the $I$-axis which are of progressively smaller length the closer they approach the origin.</p>
<p>The $I$-nullclines are given by $I = 0$ and by $S = \gamma / \beta$. For $I = 0$, we have $\dot{S} = 0$ and so these yield fixed points. For $S = \gamma / \beta$ we have $\dot{S} = - \gamma I$, resulting in exponential decay of the susceptible population, but since $\dot{I} = 0$, the proportion of infected people does not change; this is indicated in the left vector field above, where we have horizontal arrows at the dashed line given by $S = \gamma / \beta$. However, this only holds for the briefest of moments, since $S$ decreases and for $S < \gamma / \beta$ we again have $\dot{I} < 0$, and so the proportion of infected people goes down to the left of the line. Similarly, to the right of the line we have $S > \gamma / \beta$, which results in $\dot{I} > 0$, and so the proportion of infected people grows.</p>
<p>In summary, we have seen how the SIR model allows for outbreaks whenever the rate of infection is higher than the rate of recovery, $R_0 > \beta / \gamma$. If this occurs, then we have a growing proportion of infected people while $S > \gamma / \beta$. As illustratd by the vector field, the proportion of susceptible people $S$ decreases over time. At some point, therefore, we have that $S < \gamma / \beta$, resulting in a decrease in the proportion of infected people until finally $I = 0$. Observe that, in the SIR model, infections always die out. In the next section, we extend the SIR model to allow for diseases to become established in the population.</p>
<!-- The figure below shows the vector field for $\beta = 4$ and $\gamma = 1$; the nullclines are given by the black solid lines. As predicted, for any $S_0 > 1/4$ an epidemic occurs, that is, the number of infected people grows. After passing $S = 1/4$, the number of infected people decreases until it reaches a fixed point where $I = 0$. -->
<!-- ```{r, echo = FALSE, warning = FALSE, fig.align = 'center', fig.width = 8, fig.height = 8, dpi=400} -->
<!-- par(mar = c(0, 0, 0, 0)) -->
<!-- b <- 4/8 -->
<!-- g <- 1/8 -->
<!-- plot_vectorfield_SIR(beta = b, gamma = g, main = expression(beta ~ ' = 4/8,' ~ gamma ~ ' = 1/8')) -->
<!-- plot_trajectory_SIR(0.95, 0.01, beta = b, gamma = g) -->
<!-- plot_trajectory_SIR(0.8, 0.01, beta = b, gamma = g) -->
<!-- plot_trajectory_SIR(0.65, 0.01, beta = b, gamma = g) -->
<!-- plot_trajectory_SIR(0.5, 0.01, beta = b, gamma = g) -->
<!-- lines(c(1/4, 1/4), c(0, 1), lty = 2, lwd = 1) -->
<!-- # stable <- seq(0, g/b - .05, .05) -->
<!-- # unstable <- seq(g/b, 1, .05) -->
<!-- # points(x = unstable, y = rep(0, length(unstable)), cex = 1.3) -->
<!-- # points(x = seq(g/b, 1, .05), y = rep(0, length(unstable)), cex = 1.5, pch = 20, col = 'white') -->
<!-- # points(x = stable, y = rep(0, length(stable)), pch = 20, cex = 1.5) -->
<!-- ``` -->
<h2 id="the-sirs-model">The SIRS Model</h2>
<p>The SIR model assumes that once infected people are immune to the disease forever, and so any disease occurs only once and then never comes back. More interesting dynamics occur when we allow for the reinfection of recovered people; we can then ask, for example, under what circumstances the disease becomes established in the population. The SIRS model extends the SIR model, allowing the recovered population to become susceptible again (hence the extra ‘S’). It assumes that the susceptible population increases proportional to the recovered population such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{dS}{dt} &= - \beta IS + \mu R \\[0.50em]
\frac{dI}{dt} &= \beta IS - \gamma I \\[0.50em]
\frac{dR}{dt} &= \gamma I - \mu R\enspace ,
\end{aligned} %]]></script>
<p>where, since we added $\mu R$ to the change in the proportion of susceptible people, we had to subtract $\mu R$ from the change in the proportion of recovered people. We again make the simplifying assumption that the overall population does not change, and so it suffices to study the following system:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{dS}{dt} &= - \beta IS + \mu R \\[0.50em]
\frac{dI}{dt} &= \beta IS - \gamma I \enspace ,
\end{aligned} %]]></script>
<p>since $R(t) = 1 - S(t) - I(t)$. We adjust our implementation of Euler’s method:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_SIRS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="w">
</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.01</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="w">
</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="n">dimnames</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="kc">NULL</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">,</span><span class="w"> </span><span class="s1">'Time'</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">S0</span><span class="p">,</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">S0</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">I0</span><span class="p">,</span><span class="w"> </span><span class="n">delta_t</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">R</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">R</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dS</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">delta_t</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">i</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">delta_t</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">plot_SIRS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">res</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cols</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">brewer.pal</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="s1">'Set1'</span><span class="p">)</span><span class="w">
</span><span class="n">time</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">4</span><span class="p">]</span><span class="w">
</span><span class="n">matplot</span><span class="p">(</span><span class="w">
</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="n">res</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="o">:</span><span class="m">3</span><span class="p">],</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'l'</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w">
</span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Subpopulations'</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Days'</span><span class="p">,</span><span class="w"> </span><span class="n">ylim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w">
</span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.75</span><span class="p">,</span><span class="w"> </span><span class="n">cex.lab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w">
</span><span class="n">font.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">xaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">yaxs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'i'</span><span class="p">,</span><span class="w"> </span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tail</span><span class="p">(</span><span class="n">time</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="n">axis</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">las</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">cex.axis</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="n">legend</span><span class="p">(</span><span class="w">
</span><span class="m">30</span><span class="p">,</span><span class="w"> </span><span class="m">0.95</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">cols</span><span class="p">,</span><span class="w"> </span><span class="n">legend</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'S'</span><span class="p">,</span><span class="w"> </span><span class="s1">'I'</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">),</span><span class="w">
</span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">bty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below shows trajectories for a fixed recovery rate of $\gamma = 1/8$, a fixed reinfection rate of $\mu = 1/8$, and an increasing rate of infection $\beta$ for the initial condition $S_0 = 0.95$, $I_0 = 0.05$, and $R_0 = 0$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<p>As for the SIR model, we again find that no outbreak occurs for $R_0 = \beta / \gamma < 1$, which is the case for the left panel. Most interestingly, however, we find that the proportion of infected people <em>does not</em>, in contrast to the SIR model, decrease to zero for the other panels. Instead, the disease becomes established in the population when $R_0 > 1$, and the middle and the right panel show different fixed points.</p>
<p>How do things change when we vary the reinfection rate $\mu$? The figure below shows again three cases of trajectories for the same initial condition, but for a smaller reinfection rate $\mu$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>We again find no outbreak in the left panel, and outbreaks of increasing severity in the middle and right panel. Both these outbreaks are less severe compared to the outbreaks in the previous figures, as we would expect given a decrease in the reinfection rate. Similarly, the system seems to stabilize at different fixed points. In the next section, we provide a more formal analysis of the fixed points and their stability.</p>
<h2 id="analyzing-fixed-points-1">Analyzing Fixed Points</h2>
<p>To find the fixed points of the SIRS model, we again seek solutions for which:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= - \beta IS + \mu (1 - S - I) \\[0.50em]
0 &= \beta IS - \gamma I \enspace ,
\end{aligned} %]]></script>
<p>where we have substituted $R = 1 - S - I$ and from which it follows that also $\dot{R} = 0$ since we assume that the overall population does not change. We immediately see that, in contrast to the SIR model, $I = 0$ cannot be a fixed point for <em>any</em> $S$ because of the added term which depends on $\mu$. Instead, it is a fixed point only for $S = 1$. To get the other fixed point, note that the last equation gives $S = \gamma / \beta$, which plugged into the first equation yields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= -I\gamma + \mu\left(1 - \frac{\gamma}{\beta} - I\right) \\[0.50em]
I\gamma &= \mu\left(1 - \frac{\gamma}{\beta}\right) - \mu I \\[0.50em]
I(\gamma + \mu) &= \mu\left(1 - \frac{\gamma}{\beta}\right) \\[0.50em]
I &= \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} \enspace .
\end{aligned} %]]></script>
<p>Therefore, the fixed points are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(S^{\star}, I^{\star}) &= (1, 0) \\[0.50em]
(S^{\star}, I^{\star}) &= \left(\frac{\gamma}{\beta}, \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right) \enspace .
\end{aligned} %]]></script>
<p>Note that the second fixed point does not exist when $\gamma / \beta > 1$, since the proportion of infected people cannot be negative. Another, more intuitive perspective on this is to write $\gamma / \beta > 1$ as $R_0 = \beta / \gamma < 1$. This allows us to see that the second fixed point, which would have a non-zero proportion of infected people in the population, does not exist when $R_0 < 1$, as then no outbreak occurs. We will come back to this in a moment.</p>
<p>To assess the stability of the fixed points, we derive the Jacobian matrix for the SIRS model:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
J &= \begin{pmatrix}
\frac{\partial}{\partial S} \left(-\beta I S + \mu(1 - S - I)\right) & \frac{\partial }{\partial I} \left(-\beta I S + \mu(1 - S - I)\right) \\
\frac{\partial}{\partial S} \left(\beta I S - \gamma I\right) & \frac{\partial}{\partial I} \left(\beta I S - \gamma I\right) \\[0.5em]
\end{pmatrix} \\[1em]
&=
\begin{pmatrix}
-\beta I - \mu & -\beta S - \mu \\
\beta I & \beta S - \gamma
\end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>For the fixed point $(S^{\star}, I^{\star}) = (1, 0)$ we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
J_{(1, 0)} = \begin{pmatrix}
- \mu & -\beta - \mu \\
0 & \beta - \gamma
\end{pmatrix} \enspace , %]]></script>
<p>which is again upper-triangular and therefore has eigenvalues $\lambda_1 = -\mu$ and $\lambda_2 = \beta - \gamma$. This means it is unstable whenever $\beta > \gamma$ since then $\lambda_2 > 0$, and any infected individual spreads the disease. The Jacobian at the second fixed point is:</p>
<script type="math/tex; mode=display">% <![CDATA[
J_{\left(\frac{\gamma}{\beta}, \frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right)} = \begin{pmatrix}
-\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} - \mu & -\gamma - \mu \\
\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} & - 2\gamma
\end{pmatrix} \enspace , %]]></script>
<p>which is more daunting. However, we know from the previous blog post that to classify the stability of the fixed point, it suffices to look at the trace $\tau$ and determinant $\Delta$ of the Jacobian, which are given by</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\tau &= -\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} - 2\gamma \\[0.50em]
\Delta &= \left(-\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right)\left(-2\gamma\right) - \left(- \gamma - \mu\right)\left(\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu}\right) \\[0.50em]
&= 2\gamma\beta\frac{\mu\left(1 - \frac{\gamma}{\beta}\right)}{\gamma + \mu} + \beta\mu\left(1 - \frac{\gamma}{\beta}\right) \enspace .
\end{aligned} %]]></script>
<p>The trace can be written as $\tau = \lambda_1 + \lambda_2$ and the determinant can be written as $\Delta = \lambda_1 \lambda_2$, as shown in a <a href="https://fabiandablander.com/r/Linear-Love.html">previous blog post</a>. Here, we have that $\tau < 0$ because both terms above are negative, and $\Delta > 0$ because both terms above are positive. This constrains $\lambda_1$ and $\lambda_2$ to be negative, and thus the fixed point is stable.</p>
<h2 id="vector-fields-and-nullclines">Vector Fields and Nullclines</h2>
<p>As previously done for the SIR model, we can again visualize the directions in which the system changes at any point using a vector field.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_vectorfield_SIRS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">gamma</span><span class="p">,</span><span class="w"> </span><span class="n">mu</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">I</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0.05</span><span class="p">)</span><span class="w">
</span><span class="n">dS</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="o">-</span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">mu</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w">
</span><span class="n">dI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">)</span><span class="w"> </span><span class="n">beta</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">gamma</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">I</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">expand.grid</span><span class="p">(</span><span class="n">S</span><span class="p">,</span><span class="w"> </span><span class="n">I</span><span class="p">))</span><span class="w">
</span><span class="n">SI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">SI</span><span class="p">[</span><span class="n">apply</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w"> </span><span class="o"><=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="c1"># S + I <= 1 must hold</span><span class="w">
</span><span class="n">dSI</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="n">dS</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]),</span><span class="w"> </span><span class="n">dI</span><span class="p">(</span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">SI</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">]))</span><span class="w">
</span><span class="n">draw_vectorfield</span><span class="p">(</span><span class="n">SI</span><span class="p">,</span><span class="w"> </span><span class="n">dSI</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below visualizes the vector field for the SIRS model, several trajectories, and the nullclines for $\gamma = 1/8$ and $\mu = 1/8$ for $\beta = 1/8$ (left panel) and $\beta = 3/8$ (right panel). The left panel shows that there exists only one stable fixed point at $(S^{\star}, I^{\star}) = (1, 0)$ to which all trajectories converge.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-14-1.png" title="plot of chunk unnamed-chunk-14" alt="plot of chunk unnamed-chunk-14" style="display: block; margin: auto;" /></p>
<p>The right panel, on the other hand, shows <em>two</em> fixed points: one unstable fixed point at $(S^{\star}, I^{\star}) = (1, 0)$, which we only reach when $I_0 = 0$, and a stable one at</p>
<script type="math/tex; mode=display">(S^{\star}, I^{\star}) = \left(\frac{1/8}{3/8}, \frac{1/8\left(1 - \frac{3/8}{1/8}\right)}{1/8 + 1/8}\right) = (1/3, 1/3) \enspace .</script>
<p>In contrast to the SIR model, therefore, there exists a stable fixed point constituting a population which includes infected people, and so the disease is not eradicated but stays in the population.</p>
<p>The dashed lines give the nullclines. The $I$-nullcline gives the set of points where $\dot{I} = 0$, which are — as in the SIR model above — given by $I = 0$ and $S = \gamma / \beta$. The $S$-nullcline is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
0 &= - \beta I S + \mu(1 - S - I) \\[0.50em]
\beta I S &= \mu(1 - S) - \mu I \\[0.50em]
I &= \frac{\mu(1 - S)}{\beta S + \mu} \enspace ,
\end{aligned} %]]></script>
<p>which is a nonlinear function in $S$. The nullclines help us again in “organizing” the vector field. This can be seen best in the right panel above. In particular, and similar to the SIR model, we will again have a decrease in the proportion of infected people to the left of the line given by $S = \gamma / \beta$, that is, when $S < \gamma / \beta$, and an increase to the right of the line, that is, when $S > \gamma / \beta$. Similarly, the proportion of susceptible people increases when the system is “below” the $S$-nullcline, while it increases when the system is “above” the $S$-nullcline.</p>
<h2 id="bifurcations">Bifurcations</h2>
<p>In the vector fields above we have seen that the system can go from having only one fixed point to having two fixed points. Whenever a fixed point is destroyed or created or changes its stability as an internal parameter is varied — here the ratio of $\gamma / \beta$ — we speak of a <em>bifurcation</em>.</p>
<p>As pointed out above, the second equilibrium point only exists for $\gamma / \beta \leq 1$. As long as $\gamma / \beta < 1$, we have two distinct fixed points. At $\gamma / \beta = 1$, the second fixed point becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(S^{\star}, I^{\star}) &= \left(1, \frac{\mu\left(1 - 1\right)}{\gamma + \mu}\right) = (1, 0) \enspace ,
\end{aligned} %]]></script>
<p>which equals the first fixed point. Thus, at $\gamma / \beta = 1$, the two fixed points merge into one; this is the bifurcation point. This makes sense: if $\gamma / \beta < 1$, we have that $\beta / \gamma > 1$, and so an outbreak occurs, which establishes the disease in the population since we allow for reinfections.</p>
<p>We can visualize this change in fixed points in a so-called <em>bifurcation diagram</em>. A bifurcation diagram shows how the fixed points and their stability change as we vary an internal parameter. Since we deal with two-dimensional fixed points, we split the bifurcation diagram into two: the left panel shows how the $I^{\star}$ part of the fixed point changes as we vary $\gamma / \beta$, and the right panel shows how the $S^{\star}$ part of the fixed point changes as we vary $\gamma / \beta$.</p>
<p><img src="/assets/img/2020-03-22-Nonlinear-Infection.Rmd/unnamed-chunk-15-1.png" title="plot of chunk unnamed-chunk-15" alt="plot of chunk unnamed-chunk-15" style="display: block; margin: auto;" /></p>
<p>The left panel shows that as long as $\gamma / \beta < 1$, which implies that $\beta / \gamma > 1$, we have two fixed points where the stable fixed point is the one with a non-zero proportion of infected people — the disease becomes established. These fixed points are on the diagonal line, indicates as black dots. Interestingly, this shows that the proportion of infected people can never be stable at a value larger than $1/2$. There also exist unstable fixed points for which $I^{\star} = 0$. These fixed points are unstable because if there even exists only one infected person, she will spread the disease, resulting in more infected people. At the point where $\beta = \gamma$, the two fixed points merge: the disease can no longer be established in the population, and the proportion of infected people always goes to zero.</p>
<p>Similarly, the right panel shows how the fixed points $S^{\star}$ change as a function of $\gamma / \beta$. Since the infection spreads for $\beta > \gamma$, the fixed point $S^{\star} = 1$ is unstable, as the proportion of susceptible people must decrease since they become infected. For outbreaks that become increasingly mild as $\gamma / \beta \rightarrow 1$, the stable proportion of susceptible people increases, reaching $S^{\star} = 1$ when at last $\gamma = \beta$.</p>
<p>In summary, we have seen how the SIRS extends the SIR model by allowing reinfections. This resulted in possibility of more interesting fixed points, which included a non-zero proportion of infected people. In the SIRS model, then, a disease can become established in the population. In contrast to the SIR model, we have also seen that the SIRS model allows for bifuractions, going from two fixed points in times of outbreaks ($\beta > \gamma$) to one fixed point in times of no outbreaks ($\beta < \gamma$).</p>
<!-- model allows for outbreaks whenever the rate of infection is higher than the rate of recovery, $R_0 > \beta / \gamma$. If this occurs, then we have a growing proportion of infected people when $S > \gamma / \beta$. As illustratd by the vector field, the proportion of susceptible people $S$ decreases over time. At some point, therefore, we have that $S < \gamma / \beta$, resulting in a decrease in the proportion of infected people until finally $I = 0$. Observe that, in the SIR model, infections always die out. In the next section, we extend the SIR model to allow for diseases to become established in the population. -->
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen that nonlinear differential equations are a powerful tool to model real-world phenomena. They allow us to model vastly more complicated behaviour than is possible with linear differential equations, yet they rarely provide closed-form solution. Luckily, the time-evolution of a system can be straightforwardly computed with basic numerical techniques such as Euler’s method. Using the simple logistic equation, we have seen how to analyze the stability of fixed points — simply pretend the system is linear close to a fixed point.</p>
<p>The logistic equation has only one state variable — the size of the population. More interesting dynamics occur when variables interact, and we have seen how the simple SIR model can help us understand the spread of infectious disease. Consisting only of two parameters, we have seen that an outbreak occurs only when $R_0 = \beta / \gamma > 1$. Moreover, the stable fixed points always included $I = 0$, implying that the disease always gets eradicated. This is not true for all diseases because recovered people might become reinfected. The SIRS model amends this by introducing a parameter $\mu$ that quantifies how quickly recovered people can become susceptible again. As expected, this led to stable states in which the disease becomes established in the population.</p>
<p>On our journey to understand these systems, we have seen how to quantify the stability of a fixed point using linear stability analysis, how to visualize the dynamics of a system using vector fields, how nullclines give structure to such vector fields, and how bifurcations can drastically change the dynamics of a system.</p>
<p>The SIR and the SIRS models discussed here are without a doubt crude approximations of the real dynamics of the spread of infectious diseases. There exist <a href="https://en.wikipedia.org/wiki/Compartmental_models_in_epidemiology#Elaborations_on_the_basic_SIR_model">several ways to extend them</a>. One way to do so, for example, is to add an <em>exposed</em> population which are infected but are not yet infectious; see <a href="https://gabgoh.github.io/COVID/index.html">here</a> for a visualization of an elaborated version of this model in the context of SARS-CoV-2. These basic compartment models assume homogeneity of spatial-structure, which is a substantial simplification. There are various ways to include spatial structure (e.g., Watts, 2005; Riley, 2007), but that is for another blog post.</p>
<hr />
<p>I would like to thank <a href="https://twitter.com/theBonferroni">Adam Finnemann</a>, <a href="https://twitter.com/AnToniPichler">Anton Pichler</a>, and <a href="https://twitter.com/Oisin_Ryan_">Oísin Ryan</a> for very helpful comments on this blog post.</p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Strogatz, S. H. (<a href="http://www.stevenstrogatz.com/books/nonlinear-dynamics-and-chaos-with-applications-to-physics-biology-chemistry-and-engineering">2015</a>). Nonlinear Dynamics and Chaos: With applications to Physics, Biology, Chemistry, and Engineering. Colorado, US: Westview Press.</li>
<li>Hirsch, M. W., Smale, S., & Devaney, R. L. (<a href="https://books.google.nl/books?hl=en&lr=&id=rly1AAmAXh8C&oi=fnd&pg=PP1&dq=differential+equations+hirsch+smale&ots=pbe8hf2vQS&sig=XAweKN9n_n00ph33V7heYNjtjbI#v=onepage&q=differential%20equations%20hirsch%20smale&f=false">2013</a>). Differential equations, dynamical systems, and an introduction to chaos. Boston, US: Academic Press.</li>
<li>Riley, S. (<a href="https://science.sciencemag.org/content/316/5829/1298?casa_token=6o-2ffWgMtoAAAAA:N5r-4nxfob2OhYutIaFKh4n5kxTeTMNkiAxLdipRtmFrlIhkLL69NOYUBXdYcUPG_pT8LCiGXFLpY4DI">2007</a>). Large-scale spatial-transmission models of infectious disease. <em>Science, 316</em>(5829), 1298-1301.</li>
<li>Watts, D. J., Muhamad, R., Medina, D. C., & Dodds, P. S. (<a href="https://www.pnas.org/content/102/32/11157">2005</a>). Multiscale, resurgent epidemics in a hierarchical metapopulation model. <em>Proceedings of the National Academy of Sciences, 102</em>(32), 11157-11162.</li>
</ul>Fabian DablanderLast summer, I wrote about love affairs and linear differential equations. While the topic is cheerful, linear differential equations are severely limited in the types of behaviour they can model. In this blog post, which I spent writing in self-quarantine to prevent further spread of SARS-CoV-2 — take that, cheerfulness — I introduce nonlinear differential equations as a means to model infectious diseases. In particular, we will discuss the simple SIR and SIRS models, the building blocks of many of the more complicated models used in epidemiology. Before doing so, however, I discuss some of the basic tools of nonlinear dynamics applied to the logistic equation as a model for population growth. If you are already familiar with this, you can skip ahead. If you have had no prior experience with differential equations, I suggest you first check out my earlier post on the topic. I should preface this by saying that I am not an epidemiologist, and that no analysis I present here is specifically related to the current SARS-CoV-2 pandemic, nor should anything I say be interpreted as giving advice or making predictions. I am merely interested in differential equations, and as with love affairs, infectious diseases make a good illustrating case. So without further ado, let’s dive in! Modeling Population Growth Before we start modeling infectious diseases, it pays to study the concepts required to study nonlinear differential equations on a simple example: modeling population growth. Let $N > 0$ denote the size of a population and assume that its growth depends on itself: As shown in a previous blog post, this leads to exponential growth for $r > 0$: where $N_0 = N(0)$ is the initial population size at time $t = 0$. The figure below visualizes the differential equation (left panel) and its solution (right panel) for $r = 1$ and an initial population of $N_0 = 2$. This is clearly not a realistic model since the growth of a population depends on resources, which are finite. To model finite resources, we write: where $r > 0$ and $K$ is the so-called carrying capacity, that is, the maximum sized population that can be sustained by the available resources. Observe that as $N$ grows and if $K > N$, then $(1 - N / K)$ gets smaller, slowing down the growth rate $\dot{N}$. If on the other hand $N > K$, then the population needs more resources than are available, and the growth rate becomes negative, resulting in population decrease. For simplicity, let $K = 1$ and interpret $N \in [0, 1]$ as the proportion with respect to the carrying capacity; that is, $N = 1$ implies that we are at carrying capacity. The figure below visualizes the differential equation and its solution for $r = 1$ and an initial condition $N_0 = 0.10$. In contrast to exponential growth, the logistic equation leads to sigmoidal growth which approaches the carrying capacity. This is much more interesting behaviour than the linear differential equation above allows. In particular, the logistic equation has two fixed points — points at which the population neither increases nor decreases but stays fixed, that is, where $\dot{N} = 0$. These occur at $N = 0$ and at $N = 1$, as can be inferred from the left panel in the figure above. Analyzing the Stability of Fixed Points What is the stability of these fixed points? Intuitively, $N = 0$ should be unstable; if there are individuals, then they procreate and the population increases. Similarly, $N = 1$ should be stable: if $N < 1$, then $\dot{N} > 0$ and the population grows towards $N = 1$, and if $N > 1$, then $\dot{N} < 0$ and individuals die until $N = 1$. To make this argument more rigorous, and to get a more quantitative assessment of how quickly perturbations move away from or towards a fixed point, we derive a differential equation for these small perturbations close to the fixed point (see also Strogatz, 2015, p. 24). Let $N^{\star}$ denote a fixed point and define $\eta(t) = N(t) - N^{\star}$ to be a small perturbation close to the fixed point. We derive a differential equation for $\eta$ by writing: since $N^{\star}$ is a constant. This implies that the dynamics of the perturbation equal the dynamics of the population. Let $f(N)$ denote the differential equation for $N$, observe that $N = N^{\star} + \eta$ such that $\dot{N} = \dot{\eta} = f(N) = f(N^{\star} + \eta)$. Recall that $f$ is a nonlinear function, and nonlinear functions are messy to deal with. Thus, we simply pretend that the function is linear close to the fixed point. More precisely, we approximate $f$ around the fixed point using a Taylor series (see this excellent video for details) by writing: where we have ignored higher order terms. Note that, by definition, there is no change at the fixed point, that is, $f(N^{\star}) = 0$. Assuming that $f’(N^{\star}) \neq 0$ — as otherwise the higher-order terms matter, as there would be nothing else — we have that close to a fixed point which is a linear differential equation with solution: Using this trick, we can assess the stability of $N^{\star}$ as follows. If $f’(N^{\star}) < 0$, the small perturbation $\eta(t)$ around the fixed point decays towards zero, and so the system returns to the fixed point — the fixed point is stable. On the other hand, if $f’(N^{\star}) > 0$, then the small perturbation $\eta(t)$ close to the fixed point grows, and so the system does not return to the fixed point — the fixed point is unstable. Applying this to our logistic equation, we see that: Plugging in our two fixed points $N^{\star} = 0$ and $N^{\star} = 1$, we find that $f’(0) = r$ and $f’(1) = -r$. Since $r > 0$, this confirms our suspicion that $N^{\star} = 0$ is unstable and $N^{\star} = 1$ is stable. In addition, this analysis tells us how quickly the perturbations grow or decay; for the logistic equation, this is given by $r$. In sum, we have linearized a nonlinear system close to fixed points in order to assess the stability of these fixed points, and how quickly perturbations close to these fixed points grow or decay. This technique is called linear stability analysis. In the next two sections, we discuss two ways to solve differential equations using the logistic equation as an example. Analytic Solution In contrast to linear differential equations, which was the topic of a previous blog post, nonlinear differential equations can usually not be solved analytically; that is, we generally cannot get an expression that, given an initial condition, tells us the state of the system at any time point $t$. The logistic equation can, however, be solved analytically and it might be instructive to see how. We write: Staring at this for a bit, we realize that we can use partial fractions to split the integral. We write: The exponents and the logs cancel each other nicely. We write: One last trick is to multiply by $e^{-rt + Z}$, which yields: where $Z$ is the constant of integration. To solve for it, we need the initial condition. Suppose that $N(0) = N_0$, which, using the third line in the derivation above and the fact that $t = 0$, leads to: Plugging this into our solution from above yields: While this was quite a hassle, other nonlinear differential equations are much, much harder to solve, and most do not admit a closed-form solution — or at least if they do, the resulting expression is generally not very intuitive. Luckily, we can compute the time-evolution of the system using numerical methods, as illustrated in the next section. Numerical Solution A differential equation implicitly encodes how the system we model changes over time. Specifically, given a particular (potentially high-dimensional) state of the system at time point $t$, $\mathbf{x}_t$, we know in which direction and how quickly the system will change because this is exactly what is encoded in the differential equation $f = \frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t}$. This suggests the following numerical approximation: Assume we know the state of the system at a (discrete) time point $n$, denoted $x_n$, and that the change in the system is constant over a small interval $\Delta_t$. Then, the position of the system at time point $n + 1$ is given by: $\Delta t$ is an important parameter, encoding over what time period we assume the change $f$ to be constant. We can code this up in R for the logistic equation:Reviewing one year of blogging2019-12-27T12:00:00+00:002019-12-27T12:00:00+00:00https://fabiandablander.com/r/Reviewing-2019<p>Writing blog posts has been one of the most rewarding experiences for me over the last year. Some posts turned out quite long, others I could keep more concise. Irrespective of length, however, I have managed to publish one post every month, and you can infer the occassional frenzy that ensued from the distribution of the dates the posts appeared on — nine of them saw the light within the last three days of a month.</p>
<p>Some births were easier than others, yet every post evokes distinct memories: of perusing history books in the library and the Saturday sun; of writing down Gaussian integrals in overcrowded trains; of solving differential equations while singing; of hunting down typos before hurrying to parties. So to end this very productive year of blogging, below I provide a teaser of each previous post, summarizing one or two key takeaways. Let’s go!</p>
<!-- I started this blog last January, aiming to publish one blog post per month. It has been an extremely rewarding experience: every post allowed me to dive into a topic in a playful manner, and I was anew excited every month, wondering what I would write about. Some posts turned out quite lengthy, others were more concise. Far be it for me to suppose you have read every one of them, so to end this very productive year of blogging, this post provides a teaser of each previous post, summarizing one or two key take-aways. I hope you enjoy the show! -->
<!-- Blogging is great. In this post, I review what has happened since the inception of this blog in January. I will briefly summarize each blog post, and stress what I think are some key ideas. I will do so in reverse chronological order, starting with the most recent post. -->
<h1 id="an-introduction-to-causal-inference">An introduction to Causal inference</h1>
<p>Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. It dethrones randomized control trials as the only tool to license causal statements, describing the conditions under which this feat is possible even in observational data.</p>
<p>One key takeaway is to think about causal inference in a hierarchy. Association is at the most basic level, merely allowing us to say that two variables are somehow related. Moving upwards, the <em>do</em>-operator allows us to model interventions, answering questions such as “what would happen if we force every patient to take the drug”? Directed Acyclic Graphs (DAGs), as visualized in the figure below, allow us to visualize associations and causal relations.</p>
<center>
<img src="../assets/img/Seeing-vs-Doing-II.png" align="center" style="padding: 00px 00px 00px 00px;" width="750" height="500" />
</center>
<p>On the third and final level we find counterfactual statements. These follow from so-called <em>Structural Causal Models</em> — the building block of this approach to causal inference. Counterfactuals allow us to answer questions such as “would the patient have recovered had she been given the drug, even though she has not received the drug and did not recover”? Needless to say, this requires strong assumptions; yet if we want to endow machines with human-level reasoning or formalize concepts such as fairness, we need to make such strong assumptions.</p>
<p>One key practical take a way from this blog post is the definition of confounding: an effect is confounded if $p(Y \mid X) \neq p(Y \mid do(X = x))$. This means that blindly entering all variables into a regression to “control” for them is misguided; instead, one should carefuly think about the underlying causal relations between variables so as to not induce spurious associations. You can read the full blog post <a href="https://fabiandablander.com/r/Causal-Inference.html">here</a>.</p>
<h1 id="a-brief-primer-on-variational-inference">A brief primer on Variational Inference</h1>
<p>Bayesian inference using Markov chain Monte Carlo can be notoriously slow. The key idea behind variational inference is to recast Bayesian inference as an optimization problem. In particular, we try to find a distribution $q^\star(\mathbf{z})$ that best approximates the posterior distribution $p(\mathbf{z} \mid \mathbf{x})$ in terms of the Kullback-Leibler divergence:</p>
<script type="math/tex; mode=display">q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace .</script>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>In this blog post, I explain how a particular form of variational inference — <em>coordinate ascent mean-field variational inference</em> — leads to fast computations. Specifically, I walk you through deriving the variational inference scheme for a simple linear regression example. One key takeaway from this post is that Bayesians can use optimization to speed up computation. However, variational inference requires problem-specific, often tedious calculations. Black-box variational inference schemes can alleviate this issue, but Stan’s implementation — <em>automatic differentiation variational inference</em> — seems to work poorly, as detailed in the post (see also Ben Goodrich’s comment). You can read the full blog post <a href="https://fabiandablander.com/r/Variational-Inference.html">here</a>.</p>
<h1 id="harry-potter-and-the-power-of-bayesian-constrained-inference">Harry Potter and the Power of Bayesian Constrained Inference</h1>
<p>Are you a Gryffindor, Slytherin, Hufflepuff, or Ravenclaw? In this blog post, I explain a <em>prior predictive</em> perspective on model selection by having Harry, Ron, and Hermione — three subjective Bayesians — engage in a small prediction contest. There are two key takeaways. First, the prior does not completely constrain a model’s prediction, as these are being made by combining the prior with the likelihood. For example, even though Ron has a point prior on $\theta = 0.50$ in the figure below, his prediction is not that $y = 5$ always; instead, he predicts a distribution that is centered around $y = 5$. Similarly, while Hermione believes that $\theta > 0.50$, she puts probability mass on values $y < 5$.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>The second takeaway is computational. In particular, one can compute the Bayes factor of the <em>unconstrained</em> model ($\mathcal{M}_1$) — in which the parameter $\theta$ is free to vary — against a <em>constrained</em> model ($\mathcal{M}_r$) — in which $\theta$ is order-constrained (e.g., $\theta > 0.50$) — as:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{p(\theta \in [0.50, 1] \mid y, \mathcal{M}_1)}{p(\theta \in [0.50, 1] \mid \mathcal{M}_1)} \enspace .</script>
<p>In words, this Bayes factor is given by the ratio of the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. You can read the full blog post <a href="https://fabiandablander.com/r/Bayes-Potter.html">here</a>.</p>
<h1 id="love-affairs-and-linear-differential-equations">Love affairs and linear differential equations</h1>
<blockquote>
When you can fall for chains of silver, you can fall for chains of gold <br />
You can fall for pretty strangers and the promises they hold <br />
You promised me everything, you promised me thick and thin, yeah <br />
Now you just say "Oh, Romeo, yeah, you know I used to have a scene with him"
</blockquote>
<p>Differential equations are the sine qua non of modeling how systems change. This blog post provides an introduction to <em>linear</em> differential equations, which admit closed-form solutions, and analyzes the stability of fixed points.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>The key takeaways are that the natural basis of analysis is the basis spanned by the eigenvectors, and that the stability of fixed points depends directly on the eigenvalues. A system with imaginary eigenvalues can exhibit oscillating behaviour, as shown in the figure above.</p>
<p>I think I rarely had more fun writing than when writing this blog post. Inspired by Strogatz (1988), it playfully introduces linear differential equations by classifying the types of relationships Romeo and Juliet might find themselves in. While writing it, I also listened to a lot of Dire Straits, Bob Dylan, Daft Punk, and others, whose lyrics decorate the post’s section. You can read the full blog post <a href="https://fabiandablander.com/r/Linear-Love.html">here</a>.</p>
<h1 id="the-fibonacci-sequence-and-linear-algebra">The Fibonacci sequence and linear algebra</h1>
<p>1, 1, 2, 3, 5, 8, 13, 21, … The Fibonacci sequence might well be the most widely known mathematical sequence. In this blog post, I discuss how Leonardo Bonacci derived it as a solution to a puzzle about procreating rabbits, and how linear algebra can help us find a closed-form expression of the $n^{\text{th}}$ Fibonacci number.</p>
<div style="text-align:center;">
<img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" />
</div>
<p>The key insight is to realize that the $n^{\text{th}}$ Fibonacci number can be computed by repeatedly performing matrix multiplications. If one <em>diagonalizes</em> this matrix, changing basis to — again! — the eigenbasis, then the repeated application of this matrix can be expressed as a scalar power, yielding a closed-form expression of the $n^{\text{th}}$ Fibonacci number. That’s a mouthful; you can read the blog post which explains things much better <a href="https://fabiandablander.com/r/Fibonacci.html">here</a>.</p>
<h1 id="spurious-correlations-and-random-walks">Spurious correlations and random walks</h1>
<p>I was at the Santa Fe Complex Systems Summer School — the experience of a lifetime — when Anton Pichler and Andrea Bacilieri, two economists, told me that two independent random walks can be correlated substantially. I was quite shocked, to be honest. This blog post investigates this issue, concluding that regressing one random walk onto another is <em>nonsensical</em>, that is, leads to an inconsistent parameter estimate.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>As the figure above shows, such spurious correlation also occurs for independent AR(1) processes with increasing autocorrelation $\phi$, even though the resulting estimate is consistent. The key takeaway is therefore to be careful when correlating time-series. You can read the full blog post <a href="https://fabiandablander.com/r/Spurious-Correlation.html">here</a>.</p>
<h1 id="bayesian-modeling-using-stan-a-case-study">Bayesian modeling using Stan: A case study</h1>
<p>Model selection is a difficult problem. In Bayesian inference, we may distinguish between two approaches to model selection: a <em>(prior) predictive</em> perspective based on marginal likelihoods, and a <em>(posterior) predictive</em> perspective based on leave-one-out cross-validation.</p>
<p><img src="../assets/img/prediction-perspectives.png" align="center" style="padding: 10px 10px 10px 10px;" /></p>
<p>A prior predictive perspective — illustrated in the left part of the figure above — evaluates models based on their predictions about the data actually observed. These predictions are made by combining likelihood and prior. In contrast, a posterior predictive perspective — illustrated in the right panel of the figure above — evaluates models based on their predictions about data that we have not observed. These predictions cannot be directly computed, but can be approximated by combining likelihood and posterior in a leave-one-out cross-validation scheme. They key takeaway of this blog post is to appreciate this distinction, noting that not all Bayesians agree on how to select among models.</p>
<p>The post illustrates these two perspectives with a case study: does the relation between practice and reaction time follow a power law or an exponential function? You can read the full blog post <a href="https://fabiandablander.com/r/Law-of-Practice.html">here</a>.</p>
<h1 id="two-perspectives-on-regularization">Two perspectives on regularization</h1>
<p>Regularization is the process of adding information to an estimation problem so as to avoid extreme estimates. This blog post explores regularization both from a Bayesian and from a classical perspective, using the simplest example possible: estimating the bias of a coin.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>The key takeaway is the observation that Bayesians have a natural tool for regularization at their disposal: the prior. In contrast to the left panel in the figure above, which shows a flat prior, the right panel illustrates that using a weakly informative prior that peaks at $\theta = 0.50$ shifts the resulting posterior distribution towards that value. In classical statistics, one usually uses penalized maximum likelihood approaches — think lasso and ridge regression — to achieve regularization. You can read the full blog post <a href="https://fabiandablander.com/r/Regularization.html">here</a>.</p>
<h1 id="variable-selection-using-gibbs-sampling">Variable selection using Gibbs sampling</h1>
<p>“Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using spike-and-slab priors and the Gibbs sampler, a computational method to sample from a joint distribution using only conditional distributions.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Parameter estimation is almost always conditional on a specific model. One key takeaway from this blog post is that there is uncertainty associated with the model itself. The approach outlined in the post accounts for this uncertainty by using spike-and-slab priors, yielding posterior distributions not only for parameters but also for models. To incorporate this model uncertainty into parameter estimation, one can average across models; the figure above shows the <em>model-averaged</em> posterior distribution for six variables discussed in the post. You can read the full blog post <a href="https://fabiandablander.com/r/Spike-and-Slab.html">here</a>.</p>
<h1 id="two-properties-of-the-gaussian-distribution">Two properties of the Gaussian distribution</h1>
<p>The Gaussian distribution is special for a number of reasons. In this blog post, I focus on two such reasons, namely the fact that it is closed under marginalization and conditioning. This means that if you start out with a <em>p</em>-dimensional Gaussian distribution, and you either <em>marginalize over</em> or <em>condition on</em> one of its components, the resulting distribution will again be Gaussian.</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>The figure above illustrates the difference between marginalization and conditioning in the two-dimensional case. The left panel shows a bivariate Gaussian distribution with a high correlation $\rho = 0.80$ (blue contour lines). Conditioning means incorporating information, and observing that $X_2 = 2$ shifts the distribution of $X_1$ towards this value (purple line). If we do not observe $X_2$, we can incorporate our uncertainty about its likely values by marginalizing it out. This results in a Gaussian distribution that is centered on zero (black line). The right panel shows that conditioning on $X_2 = 2$ does not change the distribution of $X_1$ in the case of no correlation $\rho = 0$. You can read the full blog post <a href="https://fabiandablander.com/statistics/Two-Properties.html">here</a>.</p>
<h1 id="curve-fitting-and-the-gaussian-distribution">Curve fitting and the Gaussian distribution</h1>
<p>In this blog post, we take a look at the mother of all curve fitting problems — fitting a straight line to a number of points. The figure below shows that one point in the Euclidean plane is insufficient to define a line (left), two points constrain it perfectly (middle), and three is too much (right). In science we usually deal with more than two data points which are corrupted by noise. How do we fit a line to such noisy observations?</p>
<p><img src="/assets/img/2019-12-27-Reviewing-2019.Rmd/unnamed-chunk-8-1.png" title="plot of chunk unnamed-chunk-8" alt="plot of chunk unnamed-chunk-8" style="display: block; margin: auto 0 auto auto;" /></p>
<p>The methods of least squares provides an answer. In addition to an explanation of least squares, a key takeaway of this post is an understanding for the historical context in which least squares arose. Statistics is fascinating in part because of its rich history. On our journey through time we meet Legendre, Gauss, Laplace, and Galton. The latter describes the central limit theorem — one of the most stunning theorems in statistics — in beautifully poetic words:</p>
<blockquote>
<p>“I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66)</p>
</blockquote>
<p>You can read the full blog post <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">here</a>.</p>
<p>I hope that you enjoyed reading some of these posts at least a quarter as much as I enjoyed writing them. I am committed to making 2020 a successful year of blogging, too. However, I will most likely decrease the output frequency by half, aiming to publish one post every two months. It is a truth universally acknowledged that a person in want of a PhD must be in possession of publications, and so I will have to shift my focus accordingly (at least a little bit). At the same time, I also want to further increase my involvement in the “data for the social good” scene. Life certainly is one complicated optimization problem. I wish you all the best for the new year!</p>
<hr />
<p><em>I would like to thank Don van den Bergh, Sophia Crüwell, Jonas Haslbeck, Oisín Ryan, Lea Jakob, Quentin Gronau, Nathan Evans, Andrea Bacilieri, and Anton Pichler for helpful comments on (some of) these blog posts.</em></p>Fabian DablanderWriting blog posts has been one of the most rewarding experiences for me over the last year. Some posts turned out quite long, others I could keep more concise. Irrespective of length, however, I have managed to publish one post every month, and you can infer the occassional frenzy that ensued from the distribution of the dates the posts appeared on — nine of them saw the light within the last three days of a month. Some births were easier than others, yet every post evokes distinct memories: of perusing history books in the library and the Saturday sun; of writing down Gaussian integrals in overcrowded trains; of solving differential equations while singing; of hunting down typos before hurrying to parties. So to end this very productive year of blogging, below I provide a teaser of each previous post, summarizing one or two key takeaways. Let’s go! An introduction to Causal inference Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. It dethrones randomized control trials as the only tool to license causal statements, describing the conditions under which this feat is possible even in observational data. One key takeaway is to think about causal inference in a hierarchy. Association is at the most basic level, merely allowing us to say that two variables are somehow related. Moving upwards, the do-operator allows us to model interventions, answering questions such as “what would happen if we force every patient to take the drug”? Directed Acyclic Graphs (DAGs), as visualized in the figure below, allow us to visualize associations and causal relations. On the third and final level we find counterfactual statements. These follow from so-called Structural Causal Models — the building block of this approach to causal inference. Counterfactuals allow us to answer questions such as “would the patient have recovered had she been given the drug, even though she has not received the drug and did not recover”? Needless to say, this requires strong assumptions; yet if we want to endow machines with human-level reasoning or formalize concepts such as fairness, we need to make such strong assumptions. One key practical take a way from this blog post is the definition of confounding: an effect is confounded if $p(Y \mid X) \neq p(Y \mid do(X = x))$. This means that blindly entering all variables into a regression to “control” for them is misguided; instead, one should carefuly think about the underlying causal relations between variables so as to not induce spurious associations. You can read the full blog post here. A brief primer on Variational Inference Bayesian inference using Markov chain Monte Carlo can be notoriously slow. The key idea behind variational inference is to recast Bayesian inference as an optimization problem. In particular, we try to find a distribution $q^\star(\mathbf{z})$ that best approximates the posterior distribution $p(\mathbf{z} \mid \mathbf{x})$ in terms of the Kullback-Leibler divergence: In this blog post, I explain how a particular form of variational inference — coordinate ascent mean-field variational inference — leads to fast computations. Specifically, I walk you through deriving the variational inference scheme for a simple linear regression example. One key takeaway from this post is that Bayesians can use optimization to speed up computation. However, variational inference requires problem-specific, often tedious calculations. Black-box variational inference schemes can alleviate this issue, but Stan’s implementation — automatic differentiation variational inference — seems to work poorly, as detailed in the post (see also Ben Goodrich’s comment). You can read the full blog post here. Harry Potter and the Power of Bayesian Constrained Inference Are you a Gryffindor, Slytherin, Hufflepuff, or Ravenclaw? In this blog post, I explain a prior predictive perspective on model selection by having Harry, Ron, and Hermione — three subjective Bayesians — engage in a small prediction contest. There are two key takeaways. First, the prior does not completely constrain a model’s prediction, as these are being made by combining the prior with the likelihood. For example, even though Ron has a point prior on $\theta = 0.50$ in the figure below, his prediction is not that $y = 5$ always; instead, he predicts a distribution that is centered around $y = 5$. Similarly, while Hermione believes that $\theta > 0.50$, she puts probability mass on values $y < 5$. The second takeaway is computational. In particular, one can compute the Bayes factor of the unconstrained model ($\mathcal{M}_1$) — in which the parameter $\theta$ is free to vary — against a constrained model ($\mathcal{M}_r$) — in which $\theta$ is order-constrained (e.g., $\theta > 0.50$) — as: In words, this Bayes factor is given by the ratio of the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. You can read the full blog post here. Love affairs and linear differential equations When you can fall for chains of silver, you can fall for chains of gold You can fall for pretty strangers and the promises they hold You promised me everything, you promised me thick and thin, yeah Now you just say "Oh, Romeo, yeah, you know I used to have a scene with him" Differential equations are the sine qua non of modeling how systems change. This blog post provides an introduction to linear differential equations, which admit closed-form solutions, and analyzes the stability of fixed points. The key takeaways are that the natural basis of analysis is the basis spanned by the eigenvectors, and that the stability of fixed points depends directly on the eigenvalues. A system with imaginary eigenvalues can exhibit oscillating behaviour, as shown in the figure above. I think I rarely had more fun writing than when writing this blog post. Inspired by Strogatz (1988), it playfully introduces linear differential equations by classifying the types of relationships Romeo and Juliet might find themselves in. While writing it, I also listened to a lot of Dire Straits, Bob Dylan, Daft Punk, and others, whose lyrics decorate the post’s section. You can read the full blog post here. The Fibonacci sequence and linear algebra 1, 1, 2, 3, 5, 8, 13, 21, … The Fibonacci sequence might well be the most widely known mathematical sequence. In this blog post, I discuss how Leonardo Bonacci derived it as a solution to a puzzle about procreating rabbits, and how linear algebra can help us find a closed-form expression of the $n^{\text{th}}$ Fibonacci number. The key insight is to realize that the $n^{\text{th}}$ Fibonacci number can be computed by repeatedly performing matrix multiplications. If one diagonalizes this matrix, changing basis to — again! — the eigenbasis, then the repeated application of this matrix can be expressed as a scalar power, yielding a closed-form expression of the $n^{\text{th}}$ Fibonacci number. That’s a mouthful; you can read the blog post which explains things much better here. Spurious correlations and random walks I was at the Santa Fe Complex Systems Summer School — the experience of a lifetime — when Anton Pichler and Andrea Bacilieri, two economists, told me that two independent random walks can be correlated substantially. I was quite shocked, to be honest. This blog post investigates this issue, concluding that regressing one random walk onto another is nonsensical, that is, leads to an inconsistent parameter estimate. As the figure above shows, such spurious correlation also occurs for independent AR(1) processes with increasing autocorrelation $\phi$, even though the resulting estimate is consistent. The key takeaway is therefore to be careful when correlating time-series. You can read the full blog post here. Bayesian modeling using Stan: A case study Model selection is a difficult problem. In Bayesian inference, we may distinguish between two approaches to model selection: a (prior) predictive perspective based on marginal likelihoods, and a (posterior) predictive perspective based on leave-one-out cross-validation. A prior predictive perspective — illustrated in the left part of the figure above — evaluates models based on their predictions about the data actually observed. These predictions are made by combining likelihood and prior. In contrast, a posterior predictive perspective — illustrated in the right panel of the figure above — evaluates models based on their predictions about data that we have not observed. These predictions cannot be directly computed, but can be approximated by combining likelihood and posterior in a leave-one-out cross-validation scheme. They key takeaway of this blog post is to appreciate this distinction, noting that not all Bayesians agree on how to select among models. The post illustrates these two perspectives with a case study: does the relation between practice and reaction time follow a power law or an exponential function? You can read the full blog post here. Two perspectives on regularization Regularization is the process of adding information to an estimation problem so as to avoid extreme estimates. This blog post explores regularization both from a Bayesian and from a classical perspective, using the simplest example possible: estimating the bias of a coin. The key takeaway is the observation that Bayesians have a natural tool for regularization at their disposal: the prior. In contrast to the left panel in the figure above, which shows a flat prior, the right panel illustrates that using a weakly informative prior that peaks at $\theta = 0.50$ shifts the resulting posterior distribution towards that value. In classical statistics, one usually uses penalized maximum likelihood approaches — think lasso and ridge regression — to achieve regularization. You can read the full blog post here. Variable selection using Gibbs sampling “Which variables are important?” is a key question in science and statistics. In this blog post, I focus on linear models and discuss a Bayesian solution to this problem using spike-and-slab priors and the Gibbs sampler, a computational method to sample from a joint distribution using only conditional distributions. Parameter estimation is almost always conditional on a specific model. One key takeaway from this blog post is that there is uncertainty associated with the model itself. The approach outlined in the post accounts for this uncertainty by using spike-and-slab priors, yielding posterior distributions not only for parameters but also for models. To incorporate this model uncertainty into parameter estimation, one can average across models; the figure above shows the model-averaged posterior distribution for six variables discussed in the post. You can read the full blog post here. Two properties of the Gaussian distribution The Gaussian distribution is special for a number of reasons. In this blog post, I focus on two such reasons, namely the fact that it is closed under marginalization and conditioning. This means that if you start out with a p-dimensional Gaussian distribution, and you either marginalize over or condition on one of its components, the resulting distribution will again be Gaussian. The figure above illustrates the difference between marginalization and conditioning in the two-dimensional case. The left panel shows a bivariate Gaussian distribution with a high correlation $\rho = 0.80$ (blue contour lines). Conditioning means incorporating information, and observing that $X_2 = 2$ shifts the distribution of $X_1$ towards this value (purple line). If we do not observe $X_2$, we can incorporate our uncertainty about its likely values by marginalizing it out. This results in a Gaussian distribution that is centered on zero (black line). The right panel shows that conditioning on $X_2 = 2$ does not change the distribution of $X_1$ in the case of no correlation $\rho = 0$. You can read the full blog post here. Curve fitting and the Gaussian distribution In this blog post, we take a look at the mother of all curve fitting problems — fitting a straight line to a number of points. The figure below shows that one point in the Euclidean plane is insufficient to define a line (left), two points constrain it perfectly (middle), and three is too much (right). In science we usually deal with more than two data points which are corrupted by noise. How do we fit a line to such noisy observations? The methods of least squares provides an answer. In addition to an explanation of least squares, a key takeaway of this post is an understanding for the historical context in which least squares arose. Statistics is fascinating in part because of its rich history. On our journey through time we meet Legendre, Gauss, Laplace, and Galton. The latter describes the central limit theorem — one of the most stunning theorems in statistics — in beautifully poetic words: “I know of scarcely anything so apt to impress the imagination as the wonderful form of cosmic order expressed by the “Law of Frequency of Error”. The law would have been personified by the Greeks and deified, if they had known of it. It reigns with serenity and in complete self-effacement, amidst the wildest confusion. The huger the mob, and the greater the apparent anarchy, the more perfect is its sway. It is the supreme law of Unreason. Whenever a large sample of chaotic elements are taken in hand and marshalled in the order of their magnitude, an unsuspected and most beautiful form of regularity proves to have been latent all along.” (Galton, 1889, p. 66) You can read the full blog post here. I hope that you enjoyed reading some of these posts at least a quarter as much as I enjoyed writing them. I am committed to making 2020 a successful year of blogging, too. However, I will most likely decrease the output frequency by half, aiming to publish one post every two months. It is a truth universally acknowledged that a person in want of a PhD must be in possession of publications, and so I will have to shift my focus accordingly (at least a little bit). At the same time, I also want to further increase my involvement in the “data for the social good” scene. Life certainly is one complicated optimization problem. I wish you all the best for the new year! I would like to thank Don van den Bergh, Sophia Crüwell, Jonas Haslbeck, Oisín Ryan, Lea Jakob, Quentin Gronau, Nathan Evans, Andrea Bacilieri, and Anton Pichler for helpful comments on (some of) these blog posts.An introduction to Causal inference2019-11-30T12:00:00+00:002019-11-30T12:00:00+00:00https://fabiandablander.com/r/Causal-Inference<p><em>An extended version of this blog post is available from <a href="https://psyarxiv.com/b3fkw">here</a>.</em></p>
<p>Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. In this blog post, I provide an introduction to the graphical approach to causal inference in the tradition of Sewell Wright, Judea Pearl, and others.</p>
<p>We first rehash the common adage that correlation is not causation. We then move on to climb what Pearl calls the “ladder of causal inference”, from association (<em>seeing</em>) to intervention (<em>doing</em>) to counterfactuals (<em>imagining</em>). We will discover how directed acyclic graphs describe conditional (in)dependencies; how the <em>do</em>-calculus describes interventions; and how Structural Causal Models allow us to imagine what could have been. This blog post is by no means exhaustive, but should give you a first appreciation of the concepts that surround causal inference; references to further readings are provided below. Let’s dive in!<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<h1 id="correlation-and-causation">Correlation and Causation</h1>
<p>Messerli (2012) published a paper entitled “Chocolate Consumption, Cognitive Function, and Nobel Laureates” in <em>The New England Journal of Medicine</em> showing a strong positive relationship between chocolate consumption and the number of Nobel Laureates. I have found an even stronger relationship using updated data<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>, as visualized in the figure below.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<!-- <center> -->
<!-- <img src="../assets/img/Chocolate.png" align="center" style="padding: 10px 10px 10px 10px;" width="450" height="150"/> -->
<!-- </center> -->
<!-- Similarly, this great website tells us that US spending on science, space, and technology correlates strongly with suicides by hanging, strangulation, and suffocation: -->
<!-- <center> -->
<!-- <!-- <img src="../assets/img/US-Spending.png" align="center" style="padding: 10px 10px 10px 10px;" width="650" height="550"/> -->
<!-- <img src="../assets/img/US-Spending.png" align="center" style="padding: 10px 10px 10px 10px;"/> -->
<!-- </center> -->
<p>Now, except for <a href="https://www.confectionerynews.com/Article/2012/10/11/Chocolate-creates-Nobel-prize-winners-says-study">people in the chocolate business</a>, it would be quite a stretch to suggest that increasing chocolate consumption would increase the number Nobel Laureates. Correlation does not imply causation because it does not constrain the possible causal relations enough. Hans Reichenbach (1956) formulated the <em>common cause principle</em> which speaks to this fact:</p>
<blockquote>
<p>If two random variables $X$ and $Y$ are statistically dependent ($X \not \perp Y$), then either (a) $X$ causes $Y$, (b) $Y$ causes $X$, or (c) there exists a third variable $Z$ that causes both $X$ and $Y$. Further, $X$ and $Y$ become independent given $Z$, i.e., $X \perp Y \mid Z$.</p>
</blockquote>
<p>An in principle straightforward way to break this uncertainty is to conduct an experiment: we could, for example, force the citizens of Austria to consume more chocolate, and study whether this increases the number of Nobel laureates in the following years. Such experiments are clearly unfeasible, but even in less extreme settings it is frequently unethical, impractical, or impossible — think of smoking and lung cancer — to study an association experimentally.</p>
<p>Causal inference provides us with tools that license causal statements even in the absence of a true experiment. This comes with strong assumptions. In the next section, we discuss the “causal hierarchy”.</p>
<h1 id="the-causal-hierarchy">The Causal Hierarchy</h1>
<p>Pearl (2019a) introduces a causal hierarchy with three levels — association, intervention, and counterfactuals — as well as three prototypical actions corresponding to each level — <em>seeing</em>, <em>doing</em>, and <em>imagining</em>. In the remainder of this blog post, we will tackle each level in turn.</p>
<h1 id="seeing">Seeing</h1>
<p>Association is on the most basic level; it makes us see that two or more things are somehow related. Importantly, we need to distinguish between <em>marginal</em> associations and <em>conditional</em> associations. The latter are the key building block of causal inference. The figure below illustrates these two concepts.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>If we look at the whole, aggregated data on the left we see that the continuous variables $X$ and $Y$ are positively correlated: an increase in values for $X$ co-occurs with an increase in values for $Y$. This relation describes the <em>marginal</em> association of $X$ and $Y$ because we do not care whether $Z = 0$ or $Z = 1$. On the other hand, if we condition on the binary variable $Z$, we find that there is no relation: $X \perp Y \mid Z$. (For more on marginal and conditional associations in the case of Gaussian distributions, see <a href="https://fabiandablander.com/statistics/Two-Properties.html">this</a> blog post). In the next section, we discuss a powerful tool that allows us to visualize such dependencies.</p>
<h2 id="directed-acyclic-graphs">Directed Acyclic Graphs</h2>
<p>We can visualize the statistical dependencies between the three variables using a graph. A graph is a mathematical object that consists of nodes and edges. In the case of <em>Directed Acyclic Graphs</em> (DAGs), these edges are directed. We take our variables $(X, Y, Z$) to be nodes in such DAG and we draw (or omit) edges between these nodes so that the conditional (in)dependence structure in the data is reflected in the graph. We will explain this more formally shortly. For now, let’s focus on the relationship between the three variables. We have seen that $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. It turns out that we can draw <em>three</em> DAGs that encode this fact; these are the first three DAGs in the figure below. $X$ and $Y$ are dependent through $Z$ in these graphs, and conditioning on $Z$ <em>blocks</em> the path between $X$ and $Y$. We state this more formally shortly.</p>
<center>
<img src="../assets/img/Seeing-II.png" align="center" style="padding: 0px 0px 0px 0px;" width="750" height="375" />
</center>
<p>While it is natural to interpret the arrows causally, we do not do so here. For now, the arrows are merely tools that help us describe associations between variables.</p>
<p>The figure above also shows a fourth DAG, which encodes a different set of conditional (in)dependence relations between $X$, $Y$, and $Z$. The figure below illustrates this: looking at the aggregated data we do not find a relation between $X$ and $Y$ — they are <em>marginally independent</em> — but we do find one when looking at the disaggregated data — $X$ and $Y$ are <em>conditionally dependent</em> given $Z$.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>A real-world example might help build intuition: Looking at people who are single and who are in a relationship as a separate group, being attractive ($X$) and being intelligent ($Y$) are two independent traits. This is what we see in the left panel in the figure above. Let’s make the reasonable assumption that both being attractive and being intelligent are positively related with being in a relationship. What does this imply? First, it implies that, on average, single people are less attractive and less intelligent (see red data points). Second, and perhaps counter-intuitively, it implies that in the population of single people (and people in a relationship, respectively), being attractive and being intelligent are <em>negatively correlated</em>. After all, if the handsome person you met at the bar were also intelligent, then he would most likely be in a relationship!</p>
<p>In this example, visualized in the fourth DAG, $Z$ is commonly called a <em>collider</em>. Suppose we want to estimate the association between $X$ and $Y$ in the whole population. Conditioning on a collider (for example, by only analyzing data from people who are not in a relationship) while computing the association between $X$ and $Y$ will lead to a different estimate, and the induced bias is known as <em>collider bias</em>. It is a serious issue not only in dating, but also for example in medicine.</p>
<p>The simple graphs shown above are the building blocks of more complicated graphs. In the next section, we describe a tool that can help us find (conditional) independencies between sets of variables.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<!-- The conditional independence relations are easily glanced from these simple graphs. For *chains* and *forks*, $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. For *colliders*, we have that they are marginally independent, but conditionally dependent given $Z$ --- think of our dating example. For larger graphs, it is more difficult to see this. -->
<h2 id="d-separation">$d$-separation</h2>
<p>For large graphs, it is not obvious how to conclude that two nodes are (conditionally) independent. <em>d</em>-separation is a tool that allows us to check this algorithmically. We need to define some concepts:</p>
<ul>
<li>A <em>path</em> from $X$ to $Y$ is a sequence of nodes and edges such that the start and end nodes are $X$ and $Y$, respectively.</li>
<li>A conditioning set $\mathcal{L}$ is the set of nodes we condition on (it can be empty).</li>
<li>A collider along a path blocks that path. However, conditioning on a collider (or any of its descendants) unblocks that path.</li>
</ul>
<p>With these definitions out of the way, we call two nodes $X$ and $Y$ $d$-separated by $\mathcal{L}$ if conditioning on all members in $\mathcal{L}$ blocks all paths between the two nodes.</p>
<p>If this is your first encounter with $d$-separation, then this is a lot to wrap your head around. To get some practice, look at the graph on the left side. First, note that there are no <em>marginal</em> dependencies; this means that without conditioning or blocking nodes, any two nodes are connected by a path. For example, there is a path going from $X$ to $Y$ through $Z$, and there is a path from $V$ to $U$ going through $Y$ and $W$.</p>
<div style="float: left;">
<center>
<img src="../assets/img/Large-DAG.png" align="center" style="margin-right: 10px;" width="350" height="125" />
</center>
</div>
<p>However, there are a number of <em>conditional</em> independencies. For example, $X$ and $Y$ are conditionally independent given $Z$. Why? There are two paths from $X$ to $Y$: one through $Z$ and one through $W$. However, since $W$ is a collider on the path from $X$ to $Y$, the path is already blocked. The only unblocked path from $X$ to $Y$ is through $Z$, and conditioning on it therefore blocks all remaining open paths. Additionally conditioning on $W$ would unblock one path, and $X$ and $Y$ would again be associated.</p>
<p>So far, we have implicitly assumed that conditional (in)dependencies in the graph correspond to conditional (in)dependencies between variables. We make this assumption explicit now. In particular, note that <em>d</em>-separation provides us with an independence model $\perp_{\mathcal{G}}$ defined on graphs. To connect this to our standard probabilistic independence model $\perp_{\mathcal{P}}$ defined on random variables, we assume the following <em>Markov property</em>:</p>
<script type="math/tex; mode=display">X \perp_{\mathcal{G}} Y \mid Z \implies X \perp_{\mathcal{P}} Y \mid Z \enspace .</script>
<p>In words, we assume that if the nodes $X$ and $Y$ are <em>d</em>-separated by $Z$ in the graph $\mathcal{G}$, the corresponding random variables $X$ and $Y$ are conditionally independent given $Z$. This implies that all conditional independencies in the data are represented in the graph.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup> Moreover, the statement above implies (and is implied by) the following factorization:</p>
<script type="math/tex; mode=display">p(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace ,</script>
<p>where $\text{pa}^{\mathcal{G}}(X_i)$ denotes the parents of the node $X_i$ in graph $\mathcal{G}$ (see Peters, Janzing, & Schölkopf, p. 101). A node is a parent of another node if it has an outgoing arrow to that node; for example, $X$ is a parent of $Z$ and $W$ in the graph above. The above factorization implies that a node $X$ is independent of its non-descendants given its parents.</p>
<p><em>d</em>-separation is an extremely powerful tool. Until now, however, we have only looked at DAGs to visualize (conditional) independencies. In the next section, we go beyond <em>seeing</em> to <em>doing</em>.</p>
<h1 id="doing">Doing</h1>
<p>We do not merely want to see the world, but also change it. From this section on, we are willing to interpret DAGs causally. As Dawid (2009a) warns, this is a serious step. In merely describing conditional independencies — <em>seeing</em> — the arrows in the DAG played a somewhat minor role, being nothing but “incidental construction features supporting the $d$-separation semantics” (Dawid, 2009a, p. 66). In this section, we endow the DAG with a causal meaning and interpret the arrows as denoting <em>direct causal effects</em>.</p>
<p>What is a causal effect? Following Pearl and others, we take an <em>interventionist</em> position and say that a variable $X$ has a causal influence on $Y$ if changing $X$ leads to changes in $Y$. This position is a very useful one in practice, but not everybody agrees with it (e.g., Cartwright, 2007).</p>
<p>The figure below shows the observational DAGs from above (top row) as well as the manipulated DAGs (bottom row) where we have intervened on the variable $X$, that is, set the value of the random variable $X$ to a constant $x$. Note that setting the value of $X$ cuts all incoming causal arrows since its value is thereby determined only by the intervention, not by any other factors.</p>
<center>
<img src="../assets/img/Seeing-vs-Doing-II.png" align="center" style="padding: 00px 00px 00px 00px;" width="750" height="500" />
</center>
<p>As is easily verified with $d$-separation, the first three graphs in the top row encode the same conditional independence structure. This implies that we cannot distinguish them using only observational data. Interpreting the edges causally, however, we see that the DAGs have a starkly different interpretation. The bottom row makes this apparent by showing the result of an intervention on $X$. In the leftmost causal DAG, $Z$ is on the causal path from $X$ to $Y$, and intervening on $X$ therefore influences $Y$ through $Z$. In the DAG next, to it $Z$ is on the causal path from $Y$ to $X$, and so intervening on $X$ does not influence $Y$. In the third DAG, $Z$ is a common cause and — since there is no other path from $X$ to $Y$ — intervening on $X$ does not influence $Y$. For the collider structure in the rightmost DAG, intervening on $X$ does not influence $Y$ because there is no unblocked path from $X$ to $Y$.</p>
<p>To make the distinction between seeing and doing, Pearl introduced the <em>do</em>-operator. While $p(Y \mid X = x)$ denotes the <em>observational</em> distribution, which corresponds to the process of seeing, $p(Y \mid do(X = x))$ corresponds to the <em>interventional</em> distribution, which corresponds to the process of doing. The former describes what values $Y$ would likely take on when $X$ <em>happened to be</em> $x$, while the latter describes what values $Y$ would likely take on when $X$ <em>would be set to</em> $x$.</p>
<h2 id="computing-causal-effects">Computing causal effects</h2>
<p>$P(Y \mid do(X = x))$ describes the causal effect of $X$ on $Y$, but how do we compute it? Actually <em>doing</em> the intervention might be unfeasible or unethical — side-stepping actual interventions and still getting at causal effects is the whole point of this approach to causal inference. We want to learn causal effects from observational data, and so all we have is the observational DAG. The causal quantity, however, is defined on the manipulated DAG. We need to build a bridge between the observational DAG and the manipulated DAG, and we do this by making two assumptions.</p>
<p>First, we assume that <em>interventions are local</em>. This means that if I set $X = x$, then this only influences the variable $X$, with no other direct influence on any other variable. Of course, intervening on $X$ will influence other variables, but only through $X$, not directly through us intervening. In colloquial terms, we do not have a “fat hand”, but act like a surgeon precisely targeting only a very specific part of the DAG; we say that the DAG is composed of <em>modular</em> parts. We can encode this using the factorization property above:</p>
<script type="math/tex; mode=display">p(X_1, X_2, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace ,</script>
<p>which we now interpret causally. The factors in the product are sometimes called <em>causal Markov kernels</em>; they constitute the modular parts of the system.</p>
<p>Second, we assume that the mechanism by which variables interact do not change through interventions; that is, the mechanism by which a cause brings about its effects does not change whether this occurs naturally or by intervention (see e.g., Pearl, Glymour, & Jewell, p. 56).</p>
<p>With these two assumptions in hand, further note that $p(Y \mid do(X = x))$ can be understood as the <em>observational</em> distribution in the manipulated DAG — $p_m(Y \mid X = x)$ — that is, the DAG where we set $X = x$. This is because after <em>doing</em> the intervention (which catapults us into the manipulated DAG), all that is left for us to do is to <em>see</em> its effect. Observe that the leftmost and rightmost DAG above remain the same under intervention on $X$, and so the interventional distribution $p(Y \mid do(X = x))$ is just the conditional distribution $p(Y \mid X = x)$. The middle DAGs require a bit more work:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(Y = y \mid do(X = x)) &= p_{m}(Y = y \mid X = x) \\[.5em]
&= \sum_{z} p_{m}(Y = y, Z = z \mid X = x) \\[.5em]
&= \sum_{z} p_{m}(Y = y \mid X = x, Z = z) \, p_m(Z = z) \\[.5em]
&= \sum_{z} p(Y = y \mid X = x, Z = z) \, p(Z = z) \enspace .
\end{aligned} %]]></script>
<p>The first equality follows by definition. The second and third equality follow from the <em>sum</em> and <em>product</em> rule of probability. The last line follows from the assumption that the mechanism through which $X$ influences $Y$ is independent of whether we set $X$ or whether $X$ naturally occurs, that is, $p_{m}(Y = y \mid X = x, Z = z) = p(Y = y \mid X = x, Z = z)$, and the assumption that interventions are local, that is, $p_m(Z = z) = p(Z = z)$. Thus, the interventional distribution we care about is equal to the conditional distribution of $Y$ given $X$ when we adjust for $Z$. Graphically speaking, this blocks the path $X \leftarrow Z \leftarrow Y$ in the left middle graph and the path $X \leftarrow Z \rightarrow Y$ in the right middle graph. If there were a path $X \rightarrow Y$ in these two latter graphs, and if we would not adjust for $Z$, then the causal effect of $X$ on $Y$ would be <em>confounded</em>. For these simple DAGs, however, it is already clear from the fact that $X$ is independent of $Y$ given $Z$ that $X$ cannot have a causal effect on $Y$. In the next section, study a more complicated graph and look at confounding more closely.</p>
<h2 id="confounding">Confounding</h2>
<p>Confounding has been given various definitions over the decades, but usually denotes the situation where a (possibly unobserved) common cause obscures the causal relationship between two or more variables. Here, we are slightly more precise and call a causal effect of $X$ on $Y$ confounded if $p(Y \mid X = x) \neq p(Y \mid do(X = x))$, which also implies that collider bias is a type of confounding. This occured in the middle two DAGs in the example above, as well as in the chocolate consumption and Nobel Laureates example at the beginning of the blog post. Confounding is the bane of observational data analysis. Helpfully, causal DAGs provide us with a tool to describe multivariate relations between variables. Once we have stated our assumptions clearly, the <em>do</em>-calculus further provides us with a means to know what variables we need to adjust for so that causal effects are unconfounded.</p>
<p>We follow Pearl, Glymour, & Jewell (2016, p. 61) and define the <em>backdoor criterion</em>:</p>
<blockquote>
<p>Given two nodes $X$ and $Y$, an adjustment set $\mathcal{L}$ fulfills the backdoor criterion if no member in $\mathcal{L}$ is a descendant of $X$ and members in $\mathcal{L}$ block all paths between $X$ and $Y$. Adjusting for $\mathcal{L}$ thus yields the causal effect of $X \rightarrow Y$.</p>
</blockquote>
<p>The key observation is that this (a) blocks all spurious, that is, non-causal paths between $X$ and $Y$, (b) leaves all directed paths from $X$ to $Y$ unblocked, and (c) creates no spurious paths.</p>
<div style="float: left;">
<center>
<img src="../assets/img/Large-DAG.png" align="center" style="margin-right: 10px;" width="350" height="125" />
</center>
</div>
<p>To see this action, let’s again look at the DAG on the left. The causal effect of $Z$ on $U$ is confounded by $X$, because in addition to the legitimate causal path $Z \rightarrow Y \rightarrow W \rightarrow U$, there is also an unblocked path $Z \leftarrow X \rightarrow W \rightarrow U$ which confounds the causal effect. The backdoor criterion would have us condition on $X$, which blocks the spurious path and renders the causal effect of $Z$ on $U$ unconfounded. Note that conditioning on $W$ would also block this spurious path; however, it would also block the causal path $Z \rightarrow Y \rightarrow W \rightarrow U$.</p>
<p>Before moving on, let’s catch a quick breath. We have already discussed a number of very important concepts. At the lowest level of the causal hierarchy — association — we have discovered DAGs and $d$-separation as a powerful tool to reason about conditional (in)dependencies between variables. Moving to intervention, the second level of the causal hierarchy, we have satisfied our need to interpret the arrows in a DAG causally. Doing so required strong assumptions, but it allowed us to go beyond <em>seeing</em> and model the outcome of interventions. This hopefully clarified the notion of confounding. In particular, collider bias is a type of confounding, which has important practical implications: we should not blindly enter all variables into a regression in order to “control” for them, but think carefully about what the underlying causal DAG could look like. Otherwise, we might induce spurious associations.</p>
<p>The concepts from causal inference can help us understand methodological phenomena that have been discussed for decades. In the next section, we apply the concepts we have seen so far to make sense of one such phenomenon: <em>Simpson’s Paradox</em>.</p>
<h1 id="example-application-simpsons-paradox">Example Application: Simpson’s Paradox</h1>
<p>This section follows the example given in Pearl, Glymour, & Jewell (2016, Ch. 1) with slight modifications. Suppose you observe $N = 700$ patients who either <em>choose</em> to take a drug or not; note that this is not a randomized control trial. The table below shows the number of recovered patients split across sex.</p>
<center>
<img src="../assets/img/Simpsons-Data-I.png" align="center" style="margin-top: 20px; margin-bottom: 20px;" width="600" height="400" />
</center>
<p>We observe that more men as well as more women recover when taking the drug (93% and 73%) compared to when not taking the drug (87% and 69%). And yet, when taken together, <em>fewer</em> patients who took the drug recovered (78%) compared to patients who did not take the drug (83%). This is puzzling — should a doctor prescribe the drug or not?</p>
<p>To answer this question, we need to compute the causal effect that taking the drug has on the probability of recovery. As a first step, we draw the causal DAG. Suppose we know that women are more likely to take the drug, that being a woman has an effect on recovery more generally, and that the drug has an effect on recovery. Moreover, we know that the <em>treatment cannot cause sex</em>. This is a trivial yet crucial observation — it is impossible to express this in purely statistical language. Causal DAGs provide us with a tool to make such an assumption explicit; the graph below makes explicit that sex ($S$) is a common cause of both drug taking ($D$) and recovery ($R$). We denote $S = 1$ as being female, $D = 1$ as having chosen the drug, and $R = 1$ as having recovered. The left DAG is observational while the right DAG indicates the intervention $do(D = d)$, that is, forcing every patient to either take the drug ($d = 1$) or to not take the drug ($d = 0$).</p>
<center>
<img src="../assets/img/Simpsons-DAG-I.png" align="center" style="margin-right: 10px;" width="600" height="300" />
</center>
<p>We are interested in the probability of recovery if we would force everybody to take, or not take, the drug; we call the difference between these two probabilities the <em>average causal effect</em>. This is key: the <em>do</em>-operator is about populations, not individuals. Using it, we cannot make statements that pertain to the recovery of an individual patient; we can only refer to the probability of recovery as defined on populations of patients. We will discuss <em>individual causal effects</em> in the section on counterfactuals at the end of the blog post.</p>
<p>Computing the average causal effect requires knowledge about the interventional distributions $p(R \mid do(D = 0))$ and $p(R \mid do(D = 1))$. As discussed above, these correspond to the conditional distribution in the manipulated DAG which is shown above on the right. The backdoor criterion tells us that the conditional distribution in the observational DAG will correspond to the interventional distribution when blocking the spurious path $D \leftarrow S \rightarrow R$. Using the adjustment formula we have derived above, we expand:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(R = 1 \mid do(D = 1)) &= \sum_{s} p(R = 1\mid D = 1, S = s) \, p(S = s) \\[.5em]
&= p(R = 1\mid D = 1, S = 0) \, p(S = 0) + p(R = 1\mid D = 1, S = 1) \, p(S = 1) \\[.5em]
&= \frac{81}{87} \times \frac{87 + 270}{700} + \frac{192}{263} \times \frac{263 + 80}{700} \\[.5em]
&\approx 0.832 \enspace .
\end{aligned} %]]></script>
<p>In words, we first compute the benefit of taking the drug separately for men and women, and then we average the result by weighting it with the fraction of men and women in the population. This tells us that, if we force everybody to take the drug, about $82\%$ of people will recover. We can similarly compute the probability of recovery given we force all patients to not choose the drug:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(R = 1\mid do(D = 0)) &= \sum_{s} p(R = 1\mid D = 0, S = s) \, p(S = s) \\[.5em]
&= p(R = 1\mid D = 0, S = 0) \, p(S = 0) + p(R = 1\mid D = 0, S = 1) \, p(S = 1) \\[.5em]
&= \frac{243}{270} \times \frac{87 + 270}{700} + \frac{55}{80} \times \frac{263 + 80}{700} \\[.5em]
&\approx 0.782 \enspace .
\end{aligned} %]]></script>
<p>Therefore, taking the drug does indeed have a positive effect on recovery on average, and the doctor should prescribe the drug.</p>
<p>Note that this conclusion heavily depended on the causal graph. While graphs are wonderful tools in that they make our assumptions explicit, these assumptions are — of course — not at all guaranteed to be correct. These assumptions are strong, stating that the graph must encode all causal relations between variables, and that there is no unmeasured confounding, something we can never guarantee in observational data.</p>
<p>Let’s look at a different example but with the exact same data. In particular, instead of the variable sex we look at the <em>post-treatment</em> variable blood pressure. This means we have measured blood pressure after the patients have taken the drug. Should a doctor prescribe the drug or not?</p>
<center>
<img src="../assets/img/Simpsons-Data-II.png" align="center" style="margin-top: 20px; margin-bottom: 20px;" width="700" height="550" />
</center>
<p>Since blood pressure is a post-treatment variable, it cannot influence a patient’s decision to take the drug or not. We draw the following causal DAG, which makes clear that the drug has an indirect effect on recovery through blood pressure, in addition to having a direct causal effect.<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></p>
<center>
<img src="../assets/img/Simpsons-DAG-II.png" align="center" style="margin-right: 10px;" width="600" height="300" />
</center>
<p>From this DAG, we find that the causal effect of $D$ on $R$ is unconfounded. Therefore, the two causal quantities of interest are given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(R = 1 \mid do(D = 1)) &= p(R = 1 \mid D = 1) = 0.78 \\[.5em]
p(R = 1 \mid do(D = 0)) &= p(R = 1 \mid D = 0) = 0.83 \enspace .
\end{aligned} %]]></script>
<p>This means that the drug is indeed harmful. In the general population (combined data), the drug has a negative effect. Suppose that the drug has a direct positive effect on recovery, but an indirect negative effect through blood pressure. If we only look at patients with a particular blood pressure, then only the drug’s positive effect on recovery remains. However, since the drug does influence recovery negatively through blood pressure, it would be misleading to take the association between $D$ and $R$ conditional on $Z$ as our estimate for the causal effect. In contrast to the previous example, using the aggregate data is the correct way to analyze these data in order to estimate the average causal effect.</p>
<p>So far, our treatment has been entirely model-agnostic. In the next section, we discuss Structural Causal Models (SCM) as the fundamental building block of causal inference. This will unify the previous two levels of the causal hierarchy — <em>seeing</em> and <em>doing</em> — as well as open up the third and final level: counterfactuals.</p>
<h1 id="structural-causal-models">Structural Causal Models</h1>
<p>In this section, we discuss Structural Causal Models (SCM) as the fundamental building block of causal inference. SCMs relate causal and probabilistic statements. As an example, we specify:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
X &:= \epsilon_X \\[.5em]
Y &:= f(X, \epsilon_Y) \enspace .
\end{aligned} %]]></script>
<p>$X$ is a direct cause of $Y$ which it influences through the function $f()$, and the noise variables $\epsilon_X$ and $\epsilon_Y$ are assumed to be independent. In a SCM, we take each equation to be a causal statement, and we stress this by using the assignment symbol $:=$ instead of the equality sign $=$. Note that this is in stark contrast to standard regression models; here, we explicitly state our causal assumptions.</p>
<p>As we will see below, Structural Causal Models imply observational distributions (<em>seeing</em>), interventional distributions (<em>doing</em>), as well as counterfactuals (<em>imagining</em>). Thus, they can be seen as the fundamental building block of this approach to causal inference. In the following, we restrict the class of Structural Causal Models by allowing only linear relationships between variables and assuming independent Gaussian error terms.<sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup> As an example, take the following SCM (Peters, Janzing, & Schölkopf, 2017, p. 90):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
X &:= \epsilon_X \\[.5em]
Y &:= X + \epsilon_Y \\[.5em]
Z &:= Y + \epsilon_Z \enspace ,
\end{aligned} %]]></script>
<p>where $\epsilon_X, \epsilon_Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1)$ and $\epsilon_Z \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 0.1)$. Again, each line explicates the causal link variables. For example, we assume that $X$ has a direct causal effect on $Y$, that this effect is linear, and that it is obscured by independent Gaussian noise.</p>
<p>The assumption of Gaussian errors induces a multivariate Gaussian distribution on $(X, Y, Z)$ whose independence structure is visualized in the leftmost DAG below. The middle DAG shows an intervention on $Z$, while the rightmost DAG shows an intervention on $X$. Recall that, as discussed above, intervening on a variable cuts all incoming arrows.</p>
<center>
<img src="../assets/img/Prediction-vs-Intervention.png" align="center" style="margin-right: 10px;" width="700" height="400" />
</center>
<p>At the first level of the causal hierarchy — association — we might ask ourselves: does $X$ or $Z$ predict $Y$ better? To illustrate the answer for our example, we simulate $n = 1000$ observations from the Structural Causal model:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1000</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span></code></pre></figure>
<p>The figure below shows that $Y$ has a much stronger association with $Z$ than with $X$; this is because the standard deviation of the error $\epsilon_X$ is only a tenth of the standard deviation of the error $\epsilon_Z$. For prediction, therefore, $Z$ is the more relevant variable.</p>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>But does $Z$ actually have a causal effect on $Y$? This is a question about intervention, which is squarely located at the second level of the causal hierarchy. With the knowledge of the underlying Structural Causal Model, we can easily simulate interventions in R and visualize their outcomes:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># Simulate data from the SCM where do(Z = z)</span><span class="w">
</span><span class="n">intervene_z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">z</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">cbind</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Simulate data from the SCM where do(X = x)</span><span class="w">
</span><span class="n">intervene_x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">z</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0.1</span><span class="p">)</span><span class="w">
</span><span class="n">cbind</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">z</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">datz</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">intervene_z</span><span class="p">(</span><span class="n">z</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">datx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">intervene_x</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">par</span><span class="p">(</span><span class="n">mfrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">))</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Y'</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-6</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray76'</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'P(Y)'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">datz</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Y'</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">30</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-6</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray76'</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'P(Y | do(Z = 2))'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">hist</span><span class="p">(</span><span class="w">
</span><span class="n">datx</span><span class="p">[,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'Y'</span><span class="p">,</span><span class="w"> </span><span class="n">breaks</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w">
</span><span class="n">xlim</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-6</span><span class="p">,</span><span class="w"> </span><span class="m">6</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray76'</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'P(Y | do(X = 2))'</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-11-30-Causal-Inference.Rmd/unnamed-chunk-7-1.png" title="plot of chunk unnamed-chunk-7" alt="plot of chunk unnamed-chunk-7" style="display: block; margin: auto;" /></p>
<p>The leftmost histogram below shows the marginal distribution of $Y$ when no intervention takes place. The histogram in the middle shows the marginal distribution of $Y$ in the manipulated DAG where we set $Z = 2$. Observe that, as indicated by the causal graph, $Z$ does not have a causal effect on $Y$ such that $p(Y \mid do(Z = 2)) = p(Y)$. The histogram on the right shows the marginal distribution of $Y$ in the manipulated DAG where we set $X = 2$.</p>
<p>Clearly, then, $X$ has a causal effect on $Y$. While we have touched on it already when discussing Simpson’s paradox, we now formally define the <em>Average Causal Effect</em>:</p>
<script type="math/tex; mode=display">\text{ACE}(X \rightarrow Y) = \mathbb{E}\left[Y \mid do(X = x + 1)\right] - \mathbb{E}\left[Y \mid do(X = x)\right] \enspace ,</script>
<p>which in our case equals one, as can also be seen from the Structural Causal Model. Thus, SCMs allow us to model the outcome of interventions.<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup> However, note again that this is strictly about populations, not individuals. In the next section, we see how SCMs can allow us to climb up to the final level of the causal hierarchy, moving beyond the average to define individual causal effects.</p>
<h1 id="counterfactuals">Counterfactuals</h1>
<p>In the <em>Unbearable Lightness of Being</em>, Milan Kundera has Tomáš ask himself:</p>
<blockquote>
<p>“Was it better to be with Tereza or to remain alone?”</p>
</blockquote>
<p>To which he answers:</p>
<blockquote>
<p>“There is no means of testing which decision is better, because there is no basis for comparison. We live everything as it comes, without warning, like an actor going on cold. And what can life be worth if the first rehearsal for life is life itself?”</p>
</blockquote>
<p>Kundera is describing, as Holland (1986, p. 947) put it, the “fundamental problem of causal inference”, namely that we only ever observe one realization. If Tomáš chooses to stay with Tereza, then he cannot not choose to stay with Tereza. He cannot go back in time and revert his decision, living instead “everything as it comes, without warning”. This does not mean, however, that Tomáš cannot assess afterwards whether his choice has been wise. As a matter of fact, humans constantly evaluate mutually exclusive options, only one of which ever comes true; that is, humans reason <em>counterfactually</em>.</p>
<p>To do this formally requires strong assumptions. The <em>do</em>-operator, introduced above, is too weak to model counterfactuals. This is because it operates on distributions that are defined on populations, not on individuals. We can define an average causal effect using the <em>do</em>-operator, but — unsurprisingly — it only ever refers to averages. Structural Causal Models allow counterfactual reasoning on the level of the individual. To see this, we use a simple example.</p>
<p>Suppose we want to study the causal effect of grandma’s treatment for the common cold ($T$) on the speed of recovery ($R$). Usually, people recover from the common cold in <a href="https://en.wikipedia.org/wiki/Common_cold">seven to ten days</a>, but grandma swears she can do better with a simple intervention — we agree on doing an experiment. Assume we have the following SCM:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T &:= \epsilon_T \\[.5em]
R &:= \mu + \beta T + \epsilon \enspace ,
\end{aligned} %]]></script>
<p>where $\mu$ is an intercept, $\epsilon_T \sim \text{Bern}(0.50)$ indicates random assignment to either receive the treatment ($T = 1$) or not receive it ($T = 0$), and $\epsilon \stackrel{\text{iid}}{\sim} \mathcal{N}(0, \sigma)$. The SCM tells us that the direct causal effect of the treatment on how quickly patients recover from the common cold is $\beta$. This causal effect is obscured by individual error terms for each patient $\epsilon = (\epsilon_1, \epsilon_2, \ldots, \epsilon_N)$, which are aggregate terms for all the things left unmodelled (see <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">this</a> blog post for some history). In particular, $\epsilon_k$ summarizes all the things that have an effect on the speed of recovery for patient $k$.</p>
<p>Once we have collected the data, suppose we find that $\mu = 7$, $\beta = -2$, and $\sigma = 2$. This does speak for grandma’s treatment, since it shortens the recovery time by 2 days on average:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ACE}(T \rightarrow R) &= \mathbb{E}\left[R \mid do(T = 1)\right] - \mathbb{E}\left[R \mid do(T = 0)\right] \\[.5em]
&= \mathbb{E}\left[\mu + \beta + \epsilon\right] - \mathbb{E}\left[\mu + \epsilon\right] \\[.5em]
&= \left(\mu + \beta\right) - \mu \\[.5em]
&= \beta \enspace .
\end{aligned} %]]></script>
<p>Given the value for $\epsilon_k$, the Structural Causal Model is fully determined, and we may write $R(\epsilon_k)$ for the speed of recovery for patient $k$. To make this example more concrete, we simulate some data in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="c1"># Structural Causal Model</span><span class="w">
</span><span class="n">e_T</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">)</span><span class="w">
</span><span class="n">e</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="nb">T</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">e_T</span><span class="w">
</span><span class="n">R</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="nb">T</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">e</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nb">T</span><span class="p">,</span><span class="w"> </span><span class="n">R</span><span class="p">,</span><span class="w"> </span><span class="n">e</span><span class="p">)</span><span class="w">
</span><span class="n">head</span><span class="p">(</span><span class="n">dat</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## T R e
## [1,] 0 5.7962118 0.7962118
## [2,] 0 3.7759472 -1.2240528
## [3,] 1 3.6822394 0.6822394
## [4,] 1 0.7412738 -2.2587262
## [5,] 0 7.8660474 2.8660474
## [6,] 1 6.9607998 3.9607998</code></pre></figure>
<p>We see that the first patient did not receive the treatment ($T = 0$), took about $R = 5.80$ days to recover from the common cold, and has a unique value $\epsilon_1 = 0.78$. Would this particular patient have recovered more quickly if we had given him grandma’s treatment even though we did not? We denote this quantity of interest as $R_{T = 1}(\epsilon_1)$ to contrast it with the actually observed $R_{T = 0}(\epsilon_1)$. To compute this seemingly otherworldly quantity, we simply plug the value $T = 1$ and $\epsilon_1 = 0.78$ into our Structural Causal Model, which yields:</p>
<script type="math/tex; mode=display">R_{T = 0}(\epsilon_1) = 5 - 2 + 0.78 = 3.78 \enspace .</script>
<!-- There is one remaining complication. Since $\epsilon_1 \sim \mathcal{N}(0, \sigma)$, that is, there is uncertainty as to the effect of unmodelled factors, the counterfactual quantity $R_{T = 1}(\epsilon_1)$ is not deterministic but stochastic. We can average over this uncertainty by taking the expectation, which yields the expected duration of the recovery for patient $k = 1$ when given the treatment, even though the patient did not receive the treatment and had an actual recovery speed of $5.80$. Formally, this is: -->
<!-- $$ -->
<!-- \mathbb{E}\left[R_{T = 1} \mid T = 0, R = 5.80\right] = \mathbb{E}\left[5 - 2 + \epsilon_1\right] = 3 \enspace . -->
<!-- $$ -->
<p>Using this, we can define the <em>individual causal effect</em> as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ICE}(R \rightarrow T) &= R_{T = 1}(\epsilon_1) - R_{T = 0}(\epsilon_1) \\[.5em]
&= 5.78 - 3.78 \\[.5em]
&= 2 \enspace ,
\end{aligned} %]]></script>
<p>which in this example is equal to the average causal effect due to the <a href="https://stats.stackexchange.com/a/385558">linearity of the underlying SCM</a> (Pearl, Glymour, & Jewell 2016, p. 106). In general, individual causal effects are not identified, and we have to resort to average causal effects.<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup></p>
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{ICE}(T \rightarrow R) &= \mathbb{E}\left[R_{T = 1} \mid T = 1, R = 5.80\right] - \mathbb{E}\left[R_{T = 1} \mid T = 0, R = 5.80\right] \\[.5em] -->
<!-- &= \mathbb{E}\left[5 - 2 + \epsilon_1\right] - \mathbb{E}\left[5 + \epsilon_1\right] \\[.5em] -->
<!-- &= 3 \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<p>Answering the question of whether a particular patient would have recovered more quickly had we given him the treatment even though we did not give him the treatment seems almost fantastical. It is a <em>cross-world</em> statement: given what we have observed, we ask about what would have been if things had turned out different. It may strike you as a bit eerie to speak about different worlds. Peters, Janzing, & Schölkopf (2017, p. 106) state that it is “debatable whether this additional [counterfactual] information [encoded in the SCM] is useful.” It certainly requires strong assumptions. More broadly, Dawid (2000) argues in favour of causal inference without counterfactuals, and he does not seem to have shifted his position in <a href="https://twitter.com/fdabl/status/1110944752571158528">recent years</a>. Yet if we want to design machines that can achieve human level reasoning, we need to endow them with counterfactual thinking (Pearl, 2019a). Moreover, many concepts that a relevant in legal and ethical domains, such as fairness (Kusner et al., 2017), require counterfactuals.</p>
<p>Before we end, note that the graphical approach to causal inference outlined in this blog post is not the only game in town. The <em>potential outcome</em> framework for causal inference developed by <a href="https://en.wikipedia.org/wiki/Rubin_causal_model">Donald Rubin</a> and others avoids graphical models and takes counterfactual quantities as primary. However, although starting from counterfactual statements that are defined at the individual level, it is my understand that most work that uses potential outcomes focuses on <em>average causal effects</em>. As outlined above, this only requires the second level of the causal hierarchy — <em>doing</em> — and are therefore much less contentious than <em>individual causal effects</em>, which sit at the top of the causal hierarchy.</p>
<p>The graphical approach outlined in this blog post and the potential outcome framework are logically equivalent (Peters, Janzing, & Schölkopf, 2017, p. 125), and although there is quite some debate surrounding the two approaches, it is probably wise to be pragmatic and simply choose the tool that works best for a particular application. As Lauritzen (2004, p. 189) put it, he sees the</p>
<blockquote>
<p>“different formalisms as different ‘languages’. The French language may be best for making love whereas the Italian may be more suitable for singing, but both are indeed possible, and I have no difficulty accepting that potential responses, structural equations, and graphical models coexist as languages expressing causal concepts each with their virtues and vices.<sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup>”</p>
</blockquote>
<p>For further reading, I wholeheartedly recommend the textbooks by Pearl, Glymour, & Jewell (<a href="http://bayes.cs.ucla.edu/PRIMER/">2016</a>) as well as Peters, Janzing, & Schölkopf (<a href="https://mitpress.mit.edu/books/elements-causal-inference">2017</a>). For bedtime reading, I can recommend Pearl & McKenzie (<a href="https://www.goodreads.com/book/show/36204378-the-book-of-why">2018</a>). Miguel Hernán teaches an excellent introductory online course on causal diagrams <a href="https://www.edx.org/course/causal-diagrams-draw-your-assumptions-before-your">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have touched on several key concepts of causal inference. We have started with the puzzling observation that chocolate consumption and the number of Nobel Laureates are strongly positively related. At the lowest level of the causal ladder — association — we have seen how directed acyclic graphs can help us visualize conditional independencies, and how <em>d</em>-separation provides us with an algorithmic tool to check such independencies.</p>
<p>Moving up to the second level — intervention — we have seen how the <em>do</em>-operator models populations under interventions. This helped us define <em>confounding</em> — the bane of observational data analysis — as occuring when $p(Y \mid X = x) \neq p(Y \mid do(X = x))$. This comes with the important observation that entering all variables into a regression in order to “control” for them is misguided; rather, we need to carefully think about the underlying causal relations lest we want to introduce bias by for example conditioning on a collider. The <em>backdoor criterion</em> provided us with a graphical way to assess whether an effect is confounded or not.</p>
<p>Finally, we have seen that Structural Causal Models (SCMs) provide the building block from which observational and interventional distributions follow. SCMs further imply counterfactual statements, which sit at the top of the causal hierarchy. These allow us to move beyond the <em>do</em>-operator and average causal effects: they enable us to answer questions about what would have been if things had been different.</p>
<hr />
<p><em>I would like to thank <a href="https://ryanoisin.github.io/">Oisín Ryan</a> and <a href="https://cruwell.com/">Sophia Crüwell</a> for very helpful comments on this blog.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Bollen, K. A., & Pearl, J. (<a href="https://link.springer.com/chapter/10.1007/978-94-007-6094-3_15">2013</a>). Eight myths about causality and structural equation models. In <em>Handbook of Causal Analysis for Social Research</em> (pp. 301-328). Springer, Dordrecht.</li>
<li>Cartwright, N. (2007). <em>Hunting Causes and Using them: Approaches in Philosophy and Economics</em>. Cambridge University Press.</li>
<li>Dawid, A. P. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.2000.10474210">2000</a>). Causal inference without counterfactuals. <em>Journal of the American Statistical Association, 95</em>(450), 407-424.</li>
<li>Hernán, M.A., & Robins J.M. (<a href="https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/">2020</a>). <em>Causal Inference: What If</em>. Boca Raton: Chapman & Hall/CRC.</li>
<li>Holland, P. W. (<a href="https://www.tandfonline.com/doi/abs/10.1080/01621459.1986.10478354">1986</a>). Statistics and causal inference. <em>Journal of the American statistical Association, 81</em>(396), 945-960.</li>
<li>Kusner, M. J., Loftus, J., Russell, C., & Silva, R. (<a href="https://papers.nips.cc/paper/6995-counterfactual-fairness">2017</a>). Counterfactual fairness. In <em>Advances in Neural Information Processing Systems</em> (pp. 4066-4076).</li>
<li>Lauritzen, S. L., Aalen, O. O., Rubin, D. B., & Arjas, E. (<a href="https://www.jstor.org/stable/4616823?casa_token=aseDj2RNgjcAAAAA:iJpo1EhqcVN_89UT2AMMR0FynAC9mnake3YgBbFUoG81rNn8jbVQcQTs6NJdt3l3XDOQDRBreeILUOpvNrRglQ8CR6HQuHbg7x_F6CIIdaK_rTVfFfZMUg&seq=1#metadata_info_tab_contents">2004</a>). Discussion on Causality [with Reply]. <em>Scandinavian Journal of Statistics, 31</em>(2), 189-201.</li>
<li>Pearl, J. (<a href="https://dl.acm.org/citation.cfm?id=3241036">2019a</a>). The seven tools of causal inference, with reflections on machine learning. <em>Commun. ACM, 62</em>(3), 54-60.</li>
<li>Pearl, J. (<a href="https://www.degruyter.com/view/j/jci.2019.7.issue-1/jci-2019-2002/jci-2019-2002.xml">2019b</a>). On the Interpretation of do (x) do (x). <em>Journal of Causal Inference, 7</em>(1).</li>
<li>Pearl, J. (<a href="https://ftp.cs.ucla.edu/pub/stat_ser/r370.pdf">2012</a>). The Causal Foundations of Structural Equation Modeling.</li>
<li>Pearl, J., Glymour, M., & Jewell, N. P. (<a href="http://bayes.cs.ucla.edu/PRIMER/">2016</a>). Causal Inference in Statistics: A Primer. John Wiley & Sons.</li>
<li>Peters, J., Janzing, D., & Schölkopf, B. (<a href="https://mitpress.mit.edu/books/elements-causal-inference">2017</a>). <em>Elements of Causal Inference: Foundations and Learning Algorithms</em>. MIT Press.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The content of this blog post is an extended write-up of a one-hour lecture I gave to $3^{\text{rd}}$ year psychology undergraduate students at the University of Amsterdam. You can view the presentation, which includes exercises at the end, <a href="https://fabiandablander.com/assets/talks/Causal-Lecture">here</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Messerli (2012) was the first to look at this relationship. The data I present here are somewhat different. I include Nobel Laureates up to 2019, and I use the 2017 chocolate consumption data as reported <a href="https://www.statista.com/statistics/819288/worldwide-chocolate-consumption-by-country/">here</a>. You can download the data set <a href="https://fabiandablander.com/assets/data/nobel-chocolate.csv">here</a>. To get the data reported by Messerli (2012) into R, you can follow <a href="http://gforge.se/2012/12/chocolate-and-nobel-prize/">this</a> blogpost. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I can recommend <a href="https://www.edx.org/course/causal-diagrams-draw-your-assumptions-before-your">this</a> course on causal diagrams by Miguel Hernán to get more intuition for causal graphical models. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>If the converse implication, that is, the implication from the distribution to the graph holds, we say that the graph is <em>faithful</em> to the distribution. This is an important assumption in causal learning because it allows one to estimate causal relations from conditional independencies in the data. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>A causal effect is <em>direct</em> only at at particular level of abstraction. The drug works by inducing certain biochemical reactions that might themselves be described by DAGs. On a finer scale, then, the direct effect seizes to be direct. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
<li id="fn:6">
<p>Structural Causal Models are closely related to Structural Equation Models. The latter allow latent variables, but their causal content has been debated throughout the last century. For more information, see for example Pearl (2012) and Bollen & Pearl (2013). <a href="#fnref:6" class="reversefootnote">↩</a></p>
</li>
<li id="fn:7">
<p>For the interpretation of the <em>do</em>-operator for non-manipulable causes, see Pearl (2019b). <a href="#fnref:7" class="reversefootnote">↩</a></p>
</li>
<li id="fn:8">
<p>Here, we have focused on <em>deterministic</em> counterfactuals, assigning a single value to the counterfactual $R_{T = 1}(\epsilon_1)$. This is in contrast to <em>stochastic</em> or <em>non-deterministic</em> counterfactuals, which follow a distribution. This distinction does not matter for average causal effects, but it does for individual ones (Hernán & Robins, 2020, p. 10). <a href="#fnref:8" class="reversefootnote">↩</a></p>
</li>
<li id="fn:9">
<p>One can only hope that Bayesians and Frequentists become inspired by the pragmatism expressed here so poetically by Lauritzen. <a href="#fnref:9" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderAn extended version of this blog post is available from here. Causal inference goes beyond prediction by modeling the outcome of interventions and formalizing counterfactual reasoning. In this blog post, I provide an introduction to the graphical approach to causal inference in the tradition of Sewell Wright, Judea Pearl, and others. We first rehash the common adage that correlation is not causation. We then move on to climb what Pearl calls the “ladder of causal inference”, from association (seeing) to intervention (doing) to counterfactuals (imagining). We will discover how directed acyclic graphs describe conditional (in)dependencies; how the do-calculus describes interventions; and how Structural Causal Models allow us to imagine what could have been. This blog post is by no means exhaustive, but should give you a first appreciation of the concepts that surround causal inference; references to further readings are provided below. Let’s dive in!1 Correlation and Causation Messerli (2012) published a paper entitled “Chocolate Consumption, Cognitive Function, and Nobel Laureates” in The New England Journal of Medicine showing a strong positive relationship between chocolate consumption and the number of Nobel Laureates. I have found an even stronger relationship using updated data2, as visualized in the figure below. Now, except for people in the chocolate business, it would be quite a stretch to suggest that increasing chocolate consumption would increase the number Nobel Laureates. Correlation does not imply causation because it does not constrain the possible causal relations enough. Hans Reichenbach (1956) formulated the common cause principle which speaks to this fact: If two random variables $X$ and $Y$ are statistically dependent ($X \not \perp Y$), then either (a) $X$ causes $Y$, (b) $Y$ causes $X$, or (c) there exists a third variable $Z$ that causes both $X$ and $Y$. Further, $X$ and $Y$ become independent given $Z$, i.e., $X \perp Y \mid Z$. An in principle straightforward way to break this uncertainty is to conduct an experiment: we could, for example, force the citizens of Austria to consume more chocolate, and study whether this increases the number of Nobel laureates in the following years. Such experiments are clearly unfeasible, but even in less extreme settings it is frequently unethical, impractical, or impossible — think of smoking and lung cancer — to study an association experimentally. Causal inference provides us with tools that license causal statements even in the absence of a true experiment. This comes with strong assumptions. In the next section, we discuss the “causal hierarchy”. The Causal Hierarchy Pearl (2019a) introduces a causal hierarchy with three levels — association, intervention, and counterfactuals — as well as three prototypical actions corresponding to each level — seeing, doing, and imagining. In the remainder of this blog post, we will tackle each level in turn. Seeing Association is on the most basic level; it makes us see that two or more things are somehow related. Importantly, we need to distinguish between marginal associations and conditional associations. The latter are the key building block of causal inference. The figure below illustrates these two concepts. If we look at the whole, aggregated data on the left we see that the continuous variables $X$ and $Y$ are positively correlated: an increase in values for $X$ co-occurs with an increase in values for $Y$. This relation describes the marginal association of $X$ and $Y$ because we do not care whether $Z = 0$ or $Z = 1$. On the other hand, if we condition on the binary variable $Z$, we find that there is no relation: $X \perp Y \mid Z$. (For more on marginal and conditional associations in the case of Gaussian distributions, see this blog post). In the next section, we discuss a powerful tool that allows us to visualize such dependencies. Directed Acyclic Graphs We can visualize the statistical dependencies between the three variables using a graph. A graph is a mathematical object that consists of nodes and edges. In the case of Directed Acyclic Graphs (DAGs), these edges are directed. We take our variables $(X, Y, Z$) to be nodes in such DAG and we draw (or omit) edges between these nodes so that the conditional (in)dependence structure in the data is reflected in the graph. We will explain this more formally shortly. For now, let’s focus on the relationship between the three variables. We have seen that $X$ and $Y$ are marginally dependent but conditionally independent given $Z$. It turns out that we can draw three DAGs that encode this fact; these are the first three DAGs in the figure below. $X$ and $Y$ are dependent through $Z$ in these graphs, and conditioning on $Z$ blocks the path between $X$ and $Y$. We state this more formally shortly. While it is natural to interpret the arrows causally, we do not do so here. For now, the arrows are merely tools that help us describe associations between variables. The figure above also shows a fourth DAG, which encodes a different set of conditional (in)dependence relations between $X$, $Y$, and $Z$. The figure below illustrates this: looking at the aggregated data we do not find a relation between $X$ and $Y$ — they are marginally independent — but we do find one when looking at the disaggregated data — $X$ and $Y$ are conditionally dependent given $Z$. A real-world example might help build intuition: Looking at people who are single and who are in a relationship as a separate group, being attractive ($X$) and being intelligent ($Y$) are two independent traits. This is what we see in the left panel in the figure above. Let’s make the reasonable assumption that both being attractive and being intelligent are positively related with being in a relationship. What does this imply? First, it implies that, on average, single people are less attractive and less intelligent (see red data points). Second, and perhaps counter-intuitively, it implies that in the population of single people (and people in a relationship, respectively), being attractive and being intelligent are negatively correlated. After all, if the handsome person you met at the bar were also intelligent, then he would most likely be in a relationship! In this example, visualized in the fourth DAG, $Z$ is commonly called a collider. Suppose we want to estimate the association between $X$ and $Y$ in the whole population. Conditioning on a collider (for example, by only analyzing data from people who are not in a relationship) while computing the association between $X$ and $Y$ will lead to a different estimate, and the induced bias is known as collider bias. It is a serious issue not only in dating, but also for example in medicine. The simple graphs shown above are the building blocks of more complicated graphs. In the next section, we describe a tool that can help us find (conditional) independencies between sets of variables.3 $d$-separation For large graphs, it is not obvious how to conclude that two nodes are (conditionally) independent. d-separation is a tool that allows us to check this algorithmically. We need to define some concepts: A path from $X$ to $Y$ is a sequence of nodes and edges such that the start and end nodes are $X$ and $Y$, respectively. A conditioning set $\mathcal{L}$ is the set of nodes we condition on (it can be empty). A collider along a path blocks that path. However, conditioning on a collider (or any of its descendants) unblocks that path. With these definitions out of the way, we call two nodes $X$ and $Y$ $d$-separated by $\mathcal{L}$ if conditioning on all members in $\mathcal{L}$ blocks all paths between the two nodes. If this is your first encounter with $d$-separation, then this is a lot to wrap your head around. To get some practice, look at the graph on the left side. First, note that there are no marginal dependencies; this means that without conditioning or blocking nodes, any two nodes are connected by a path. For example, there is a path going from $X$ to $Y$ through $Z$, and there is a path from $V$ to $U$ going through $Y$ and $W$. However, there are a number of conditional independencies. For example, $X$ and $Y$ are conditionally independent given $Z$. Why? There are two paths from $X$ to $Y$: one through $Z$ and one through $W$. However, since $W$ is a collider on the path from $X$ to $Y$, the path is already blocked. The only unblocked path from $X$ to $Y$ is through $Z$, and conditioning on it therefore blocks all remaining open paths. Additionally conditioning on $W$ would unblock one path, and $X$ and $Y$ would again be associated. So far, we have implicitly assumed that conditional (in)dependencies in the graph correspond to conditional (in)dependencies between variables. We make this assumption explicit now. In particular, note that d-separation provides us with an independence model $\perp_{\mathcal{G}}$ defined on graphs. To connect this to our standard probabilistic independence model $\perp_{\mathcal{P}}$ defined on random variables, we assume the following Markov property: In words, we assume that if the nodes $X$ and $Y$ are d-separated by $Z$ in the graph $\mathcal{G}$, the corresponding random variables $X$ and $Y$ are conditionally independent given $Z$. This implies that all conditional independencies in the data are represented in the graph.4 Moreover, the statement above implies (and is implied by) the following factorization: where $\text{pa}^{\mathcal{G}}(X_i)$ denotes the parents of the node $X_i$ in graph $\mathcal{G}$ (see Peters, Janzing, & Schölkopf, p. 101). A node is a parent of another node if it has an outgoing arrow to that node; for example, $X$ is a parent of $Z$ and $W$ in the graph above. The above factorization implies that a node $X$ is independent of its non-descendants given its parents. d-separation is an extremely powerful tool. Until now, however, we have only looked at DAGs to visualize (conditional) independencies. In the next section, we go beyond seeing to doing. Doing We do not merely want to see the world, but also change it. From this section on, we are willing to interpret DAGs causally. As Dawid (2009a) warns, this is a serious step. In merely describing conditional independencies — seeing — the arrows in the DAG played a somewhat minor role, being nothing but “incidental construction features supporting the $d$-separation semantics” (Dawid, 2009a, p. 66). In this section, we endow the DAG with a causal meaning and interpret the arrows as denoting direct causal effects. What is a causal effect? Following Pearl and others, we take an interventionist position and say that a variable $X$ has a causal influence on $Y$ if changing $X$ leads to changes in $Y$. This position is a very useful one in practice, but not everybody agrees with it (e.g., Cartwright, 2007). The figure below shows the observational DAGs from above (top row) as well as the manipulated DAGs (bottom row) where we have intervened on the variable $X$, that is, set the value of the random variable $X$ to a constant $x$. Note that setting the value of $X$ cuts all incoming causal arrows since its value is thereby determined only by the intervention, not by any other factors. As is easily verified with $d$-separation, the first three graphs in the top row encode the same conditional independence structure. This implies that we cannot distinguish them using only observational data. Interpreting the edges causally, however, we see that the DAGs have a starkly different interpretation. The bottom row makes this apparent by showing the result of an intervention on $X$. In the leftmost causal DAG, $Z$ is on the causal path from $X$ to $Y$, and intervening on $X$ therefore influences $Y$ through $Z$. In the DAG next, to it $Z$ is on the causal path from $Y$ to $X$, and so intervening on $X$ does not influence $Y$. In the third DAG, $Z$ is a common cause and — since there is no other path from $X$ to $Y$ — intervening on $X$ does not influence $Y$. For the collider structure in the rightmost DAG, intervening on $X$ does not influence $Y$ because there is no unblocked path from $X$ to $Y$. To make the distinction between seeing and doing, Pearl introduced the do-operator. While $p(Y \mid X = x)$ denotes the observational distribution, which corresponds to the process of seeing, $p(Y \mid do(X = x))$ corresponds to the interventional distribution, which corresponds to the process of doing. The former describes what values $Y$ would likely take on when $X$ happened to be $x$, while the latter describes what values $Y$ would likely take on when $X$ would be set to $x$. Computing causal effects $P(Y \mid do(X = x))$ describes the causal effect of $X$ on $Y$, but how do we compute it? Actually doing the intervention might be unfeasible or unethical — side-stepping actual interventions and still getting at causal effects is the whole point of this approach to causal inference. We want to learn causal effects from observational data, and so all we have is the observational DAG. The causal quantity, however, is defined on the manipulated DAG. We need to build a bridge between the observational DAG and the manipulated DAG, and we do this by making two assumptions. First, we assume that interventions are local. This means that if I set $X = x$, then this only influences the variable $X$, with no other direct influence on any other variable. Of course, intervening on $X$ will influence other variables, but only through $X$, not directly through us intervening. In colloquial terms, we do not have a “fat hand”, but act like a surgeon precisely targeting only a very specific part of the DAG; we say that the DAG is composed of modular parts. We can encode this using the factorization property above: which we now interpret causally. The factors in the product are sometimes called causal Markov kernels; they constitute the modular parts of the system. Second, we assume that the mechanism by which variables interact do not change through interventions; that is, the mechanism by which a cause brings about its effects does not change whether this occurs naturally or by intervention (see e.g., Pearl, Glymour, & Jewell, p. 56). With these two assumptions in hand, further note that $p(Y \mid do(X = x))$ can be understood as the observational distribution in the manipulated DAG — $p_m(Y \mid X = x)$ — that is, the DAG where we set $X = x$. This is because after doing the intervention (which catapults us into the manipulated DAG), all that is left for us to do is to see its effect. Observe that the leftmost and rightmost DAG above remain the same under intervention on $X$, and so the interventional distribution $p(Y \mid do(X = x))$ is just the conditional distribution $p(Y \mid X = x)$. The middle DAGs require a bit more work: The first equality follows by definition. The second and third equality follow from the sum and product rule of probability. The last line follows from the assumption that the mechanism through which $X$ influences $Y$ is independent of whether we set $X$ or whether $X$ naturally occurs, that is, $p_{m}(Y = y \mid X = x, Z = z) = p(Y = y \mid X = x, Z = z)$, and the assumption that interventions are local, that is, $p_m(Z = z) = p(Z = z)$. Thus, the interventional distribution we care about is equal to the conditional distribution of $Y$ given $X$ when we adjust for $Z$. Graphically speaking, this blocks the path $X \leftarrow Z \leftarrow Y$ in the left middle graph and the path $X \leftarrow Z \rightarrow Y$ in the right middle graph. If there were a path $X \rightarrow Y$ in these two latter graphs, and if we would not adjust for $Z$, then the causal effect of $X$ on $Y$ would be confounded. For these simple DAGs, however, it is already clear from the fact that $X$ is independent of $Y$ given $Z$ that $X$ cannot have a causal effect on $Y$. In the next section, study a more complicated graph and look at confounding more closely. Confounding Confounding has been given various definitions over the decades, but usually denotes the situation where a (possibly unobserved) common cause obscures the causal relationship between two or more variables. Here, we are slightly more precise and call a causal effect of $X$ on $Y$ confounded if $p(Y \mid X = x) \neq p(Y \mid do(X = x))$, which also implies that collider bias is a type of confounding. This occured in the middle two DAGs in the example above, as well as in the chocolate consumption and Nobel Laureates example at the beginning of the blog post. Confounding is the bane of observational data analysis. Helpfully, causal DAGs provide us with a tool to describe multivariate relations between variables. Once we have stated our assumptions clearly, the do-calculus further provides us with a means to know what variables we need to adjust for so that causal effects are unconfounded. We follow Pearl, Glymour, & Jewell (2016, p. 61) and define the backdoor criterion: Given two nodes $X$ and $Y$, an adjustment set $\mathcal{L}$ fulfills the backdoor criterion if no member in $\mathcal{L}$ is a descendant of $X$ and members in $\mathcal{L}$ block all paths between $X$ and $Y$. Adjusting for $\mathcal{L}$ thus yields the causal effect of $X \rightarrow Y$. The key observation is that this (a) blocks all spurious, that is, non-causal paths between $X$ and $Y$, (b) leaves all directed paths from $X$ to $Y$ unblocked, and (c) creates no spurious paths. To see this action, let’s again look at the DAG on the left. The causal effect of $Z$ on $U$ is confounded by $X$, because in addition to the legitimate causal path $Z \rightarrow Y \rightarrow W \rightarrow U$, there is also an unblocked path $Z \leftarrow X \rightarrow W \rightarrow U$ which confounds the causal effect. The backdoor criterion would have us condition on $X$, which blocks the spurious path and renders the causal effect of $Z$ on $U$ unconfounded. Note that conditioning on $W$ would also block this spurious path; however, it would also block the causal path $Z \rightarrow Y \rightarrow W \rightarrow U$. Before moving on, let’s catch a quick breath. We have already discussed a number of very important concepts. At the lowest level of the causal hierarchy — association — we have discovered DAGs and $d$-separation as a powerful tool to reason about conditional (in)dependencies between variables. Moving to intervention, the second level of the causal hierarchy, we have satisfied our need to interpret the arrows in a DAG causally. Doing so required strong assumptions, but it allowed us to go beyond seeing and model the outcome of interventions. This hopefully clarified the notion of confounding. In particular, collider bias is a type of confounding, which has important practical implications: we should not blindly enter all variables into a regression in order to “control” for them, but think carefully about what the underlying causal DAG could look like. Otherwise, we might induce spurious associations. The concepts from causal inference can help us understand methodological phenomena that have been discussed for decades. In the next section, we apply the concepts we have seen so far to make sense of one such phenomenon: Simpson’s Paradox. Example Application: Simpson’s Paradox This section follows the example given in Pearl, Glymour, & Jewell (2016, Ch. 1) with slight modifications. Suppose you observe $N = 700$ patients who either choose to take a drug or not; note that this is not a randomized control trial. The table below shows the number of recovered patients split across sex. The content of this blog post is an extended write-up of a one-hour lecture I gave to $3^{\text{rd}}$ year psychology undergraduate students at the University of Amsterdam. You can view the presentation, which includes exercises at the end, here. ↩ Messerli (2012) was the first to look at this relationship. The data I present here are somewhat different. I include Nobel Laureates up to 2019, and I use the 2017 chocolate consumption data as reported here. You can download the data set here. To get the data reported by Messerli (2012) into R, you can follow this blogpost. ↩ I can recommend this course on causal diagrams by Miguel Hernán to get more intuition for causal graphical models. ↩ If the converse implication, that is, the implication from the distribution to the graph holds, we say that the graph is faithful to the distribution. This is an important assumption in causal learning because it allows one to estimate causal relations from conditional independencies in the data. ↩A brief primer on Variational Inference2019-10-30T12:00:00+00:002019-10-30T12:00:00+00:00https://fabiandablander.com/r/Variational-Inference<link rel="stylesheet" href="../highlight/styles/default.css" />
<script src="../highlight/highlight.pack.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<script>$('pre.stan code').each(function(i, block) {hljs.highlightBlock(block);});</script>
<p>Bayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let’s start unpacking!</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>Bayes’ rule states that</p>
<script type="math/tex; mode=display">\underbrace{p(\mathbf{z} \mid \mathbf{x})}_{\text{Posterior}} = \underbrace{p(\mathbf{z})}_{\text{Prior}} \times \frac{\overbrace{p(\mathbf{x} \mid \mathbf{z})}^{\text{Likelihood}}}{\underbrace{\int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}}_{\text{Marginal Likelihood}}} \enspace ,</script>
<p>where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> Bayes’ rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much <a href="https://xkcd.com/2117/">easier</a> and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization.</p>
<h1 id="variational-objective">Variational objective</h1>
<p>We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution:</p>
<script type="math/tex; mode=display">q^\star(\mathbf{z}) = \underbrace{\text{argmin}}_{q(\mathbf{z}) \in \mathrm{Q}} \text{ KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \enspace ,</script>
<p>where $\text{KL}(. \lvert \lvert.)$ denotes the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler divergence</a>:</p>
<script type="math/tex; mode=display">\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) = \int q(\mathbf{z}) \, \text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})} \mathrm{d}\mathbf{z} \enspace .</script>
<p>We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z} \mid \mathbf{x})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z} \mid \mathbf{x})\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z}, \mathbf{x})}{p(\mathbf{x})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x})\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \int q(\mathbf{z}) \, \text{log } p(\mathbf{x}) \, \mathrm{d}\mathbf{z} \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x})}_{\text{Nemesis}} \enspace ,
\end{aligned} %]]></script>
<p>where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x}) \right) \\[.5em]
&= -\left(\mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] + \underbrace{\text{log } p(\mathbf{x}) - \text{log } p(\mathbf{x})}_{\text{Nemesis perishes}}\right) \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \enspace .
\end{aligned} %]]></script>
<p>We can expand the joint probability to get more insight into this equation:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= \underbrace{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z})\right]}_{\mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]} - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] + \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{p(\mathbf{z})}{q(\mathbf{z})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } \frac{q(\mathbf{z})}{p(\mathbf{z})}\right] \\[.5em]
&= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{x} \mid \mathbf{z})\right] - \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z})\right) \enspace .
\end{aligned} %]]></script>
<p>This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017).</p>
<p>ELBO stands for <em>evidence lower bound</em>. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= -\left(\text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) - \text{log } p(\mathbf{x})\right) \\[.5em]
\text{log } p(\mathbf{x}) &= \text{ELBO}(q) + \text{KL}\left(q(\mathbf{z}) \, \lvert\lvert \, p(\mathbf{z} \mid \mathbf{x}) \right) \\[.5em]
\text{log } p(\mathbf{x}) &\geq \text{ELBO}(q) \enspace ,
\end{aligned} %]]></script>
<p>since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see <a href="https://fabiandablander.com/r/Law-of-Practice.html">this</a> and <a href="https://fabiandablander.com/r/Bayes-Potter.html">this</a> blog post.</p>
<h1 id="why-variational">Why variational?</h1>
<p>Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the <em>single best value</em> that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">this</a> blog post). In our setting, we do not want to find a single best value but rather a <em>single best function</em>. To do this, we can use <em>variational calculus</em> from which variational inference derives its name (Bishop, 2006, p. 462).</p>
<p>A function takes an input value and returns an output value. We can define a <em>functional</em> which takes a whole function and returns an output value. The <em>entropy</em> of a probability distribution is a widely used functional:</p>
<script type="math/tex; mode=display">\text{H}[p] = \int p(x) \, \text{log } p(x) \mathrm{d} x \enspace ,</script>
<p>which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional.</p>
<p>In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem.</p>
<h1 id="mean-field-variational-family">Mean-field variational family</h1>
<p>A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density:</p>
<script type="math/tex; mode=display">q(\mathbf{z}) = \prod_{j=1}^m q_j(z_j) \enspace .</script>
<p>Note that this <em>mean-field variational family</em> cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem.</p>
<p>We start from our definition of the ELBO and apply the mean-field assumption:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= \mathbb{E}_{q(\mathbf{z})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathbb{E}_{q(\mathbf{z})}\left[\text{log } q(\mathbf{z}) \right] \\[.5em]
&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z}\enspace .
\end{aligned} %]]></script>
<p>In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q_j) &= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int \prod_{i=1}^m q_i(z_i) \, \text{log}\prod_{i=1}^m q_i(z_i) \, \mathrm{d}\mathbf{z} \\[.5em]
&= \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j - \underbrace{\int \prod_{i\neq j}^m q_i(z_i) \, \text{log} \prod_{i\neq j}^m q_i(z_i) \, \mathrm{d}\mathbf{z}_{-j}}_{\text{Constant with respect to } q_j(z_j)} \\[.5em]
&\propto \int \prod_{i=1}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z} - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \left(\int \prod_{i\neq j}^m q_i(z_i) \, \text{log } p(\mathbf{z}, \mathbf{x}) \, \mathrm{d}\mathbf{z}_{-j}\right)\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \, \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right]\mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \enspace .
\end{aligned} %]]></script>
<p>One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution</p>
<script type="math/tex; mode=display">\text{log } \tilde{p}{(\mathbf{x}, z_j)} = \mathbb{E}_{q(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{z}, \mathbf{x})\right] - \mathcal{Z} \enspace ,</script>
<p>where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q_j) &\propto \int q_j(z_j) \, \text{log } \tilde{p}{(\mathbf{x}, z_j)} \, \mathrm{d}z_j - \int q_j(z_j) \, \text{log } q_j(z_j) \, \mathrm{d}z_j \\[.5em]
&= \int q_j(z_j) \, \text{log } \frac{\tilde{p}{(\mathbf{x}, z_j)}}{q_j(z_j)} \, \mathrm{d}z_j \\[.5em]
&= -\int q_j(z_j) \, \text{log } \frac{q_j(z_j)}{\tilde{p}{(\mathbf{x}, z_j)}} \, \mathrm{d}z_j \\[.5em]
&= -\text{KL}\left(q_j(z_j) \, \lvert\lvert \, \tilde{p}(\mathbf{x}, z_j)\right) \enspace .
\end{aligned} %]]></script>
<p>Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
q_j^\star(z_j) &= \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right] - \mathcal{Z}\right) \\[.5em]
&= \frac{\text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right)}{\int \text{exp}\left(\mathbb{E}_{q_{-j}(\mathbf{z}_{-j})}\left[\text{log } p(\mathbf{x}, \mathbf{z}) \right]\right) \mathrm{d}z_j} \enspace ,
\end{aligned} %]]></script>
<p>see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as <em>Coordinate Ascent Variational Inference</em> (CAVI). Further, note that</p>
<script type="math/tex; mode=display">p(z_j \mid \mathbf{z}_{-j}, \mathbf{x}) = \frac{p(z_j, \mathbf{z}_{-j}, \mathbf{x})}{p(\mathbf{z}_{-j}, \mathbf{x})} \propto p(z_j, \mathbf{z}_{-j}, \mathbf{x}) \enspace ,</script>
<p>which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks <em>a lot</em> like Gibbs sampling, which we discussed in detail in a <a href="https://fabiandablander.com/r/Spike-and-Slab.html">previous</a> blog post. In the next section, we implement CAVI for a simple linear regression problem.</p>
<h1 id="application-linear-regression">Application: Linear regression</h1>
<p>In a <a href="https://fabiandablander.com/r/Curve-Fitting-Gaussian.html">previous</a> blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
y &\sim \mathcal{N}(\beta x, \sigma^2) \\[.5em]
\beta &\sim \mathcal{N}(0, \sigma^2 \tau^2) \\[.5em]
\sigma^2 &\propto \frac{1}{\sigma^2} \enspace ,
\end{aligned} %]]></script>
<p>where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys’ prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification:</p>
<script type="math/tex; mode=display">\text{Var}\left[\frac{\beta}{\sigma}\right] = \frac{1}{\sigma^2} \text{Var}[\beta] = \frac{\sigma^2 \tau^2}{\sigma^2} = \tau^2 \enspace .</script>
<p>As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$.</p>
<h2 id="variational-density-for-sigma2">Variational density for $\sigma^2$</h2>
<p>Our optimal variational density $q^\star(\sigma^2)$ is given by:</p>
<script type="math/tex; mode=display">q^\star(\sigma^2) \propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \enspace .</script>
<p>To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\sigma^2 \mid \mathbf{y}, \beta) &\propto p(\mathbf{y} \mid \sigma^2, \beta) \, p(\beta) \, p(\sigma^2) \\[.5em]
&= \prod_{i=1}^n (2\pi)^{-\frac{1}{2}} \left(\sigma^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2} \left(y_i - \beta x_i\right)^2\right) \underbrace{(2\pi)^{-\frac{1}{2}} \left(\sigma^2\tau^2\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}_{p(\beta)} \underbrace{\left(\sigma^2\right)^{-1}}_{p(\sigma^2)} \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&\propto\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{2\sigma^2} \underbrace{\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)}_{A}\right) \enspace ,
\end{aligned} %]]></script>
<p>which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
q^\star(\sigma^2) &\propto \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } p(\sigma^2 \mid \mathbf{y}, \beta) \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{2\sigma^2}A \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\beta)}\left[\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1}\right] - \mathbb{E}_{q(\beta)}\left[\frac{1}{2\sigma^2}A \right]\right) \\[.5em]
&= \text{exp}\left(\text{log } \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} - \frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \\[.5em]
&= \left(\sigma^2\right)^{-\frac{n+1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2}\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]\right) \enspace .
\end{aligned} %]]></script>
<p>This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at:</p>
<script type="math/tex; mode=display">q^\star(\sigma^2)= \frac{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \underbrace{\mathbb{E}_{q(\beta)}\left[\frac{1}{2}A \right]}_{\nu}\right) \enspace .</script>
<p>Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$.</p>
<h2 id="variational-density-for-beta">Variational density for $\beta$</h2>
<p>Our optimal variational density $q^\star(\beta)$ is given by:</p>
<script type="math/tex; mode=display">q^\star(\beta) \propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \enspace ,</script>
<p>and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\beta \mid \mathbf{y}, \sigma^2) &\propto p(\mathbf{y} \mid \beta, \sigma^2) \, p(\beta) \, p(\sigma^2) \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&= (2\pi)^{-\frac{n + 1}{2}} \left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \left(\tau^2\right)^{-1}\text{exp}\left(-\frac{1}{2\sigma^2} \left(\sum_{i=1}^ny_i^2- 2 \beta \sum_{i=1}^n y_i x_i + \beta^2 \sum_{i=1}^n x_i^2 + \frac{\beta^2}{\tau^2}\right)\right) \\[.5em]
&\propto \text{exp}\left(-\frac{1}{2\sigma^2} \left( \beta^2 \left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) - 2 \beta \sum_{i=1}^n y_i x_i\right)\right) \\[.5em]
&=\text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta^2 - 2 \beta \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)\right) \\[.5em]
&\propto \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,
\end{aligned} %]]></script>
<p>where we have “completed the square” (see also <a href="https://fabiandablander.com/statistics/Two-Properties.html">this</a> blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
q^\star(\beta) &\propto \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\beta \mid \mathbf{y}, \sigma^2) \right]\right) \\[.5em]
&= \text{exp}\left(\mathbb{E}_{q(\sigma^2)}\left[-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2\sigma^2} \left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right]\right) \\[.5em]
&= \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left( \beta - \frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}\right)^2\right) \enspace ,
\end{aligned} %]]></script>
<p>which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields:</p>
<script type="math/tex; mode=display">q^\star(\beta) = \left(2\pi\underbrace{\frac{\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}}_{\sigma^2_{\beta}}\right)^{-\frac{1}{2}} \text{exp}\left(-\frac{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}}{2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right]\left(\beta - \underbrace{\frac{\sum_{i=1}^n y_i x_i}{\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)}}_{\mu_{\beta}}\right)^2\right) \enspace ,</script>
<p>Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not.</p>
<p>To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys’s prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations.</p>
<h2 id="removing-expectations">Removing expectations</h2>
<p>Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, to remove the expectation in the variational density for $\sigma^2$, we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\beta)}\left[A \right] &= \mathbb{E}_{q(\beta)}\left[\left(\sum_{i=1}^n\left(y_i - \beta x_i\right)^2 + \frac{\beta^2}{\tau^2}\right)\right] \\[.5em]
&= \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mathbb{E}_{q(\beta)}\left[\beta\right] + \sum_{i=1}^n x_i^2 \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] + \frac{1}{\tau^2} \, \mathbb{E}_{q(\beta)}\left[\beta^2\right] \enspace .
\end{aligned} %]]></script>
<p>Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q(\beta)}[\beta^2] = \text{Var}_{q(\beta)}\left[\beta\right] + \mathbb{E}_{q(\beta)}[\beta]^2
= \sigma^2_{\beta} + \mu_{\beta}^2 \enspace ,</script>
<p>the expectation becomes:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q(\beta)}\left[A\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right) \enspace .</script>
<p>For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] &= \int \frac{1}{\sigma^2}\frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right) \mathrm{d}\sigma^2\\[0.50em]
&= \frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)} \int \left(\sigma^2\right)^{-\left(\frac{n + 1}{2} + 1\right) - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right) \mathrm{d}\sigma^2 \\[0.50em]
&= \frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)} \frac{\Gamma\left(\frac{n + 1}{2} + 1\right)}{\nu^{\frac{n + 1}{2} + 1}} \\[0.50em]
&= \frac{n + 1}{2} \left(\frac{1}{2}\mathbb{E}_{q(\beta)}\left[A \right]\right)^{-1} \\[.5em]
&= \frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \enspace .
\end{aligned} %]]></script>
<h2 id="monitoring-convergence">Monitoring convergence</h2>
<p>The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{ELBO}(q) &= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y}, \beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2) \right] \\[.5em]
&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \mathbb{E}_{p(\beta, \sigma^2)}\left[\text{log } p(\beta, \sigma^2)\right] - \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } q(\beta, \sigma^2)\right] \\[.5em]
&= \mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] + \underbrace{\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right]}_{-\text{KL}\left(q(\beta, \sigma^2) \, \lvert\lvert \, p(\beta, \sigma^2)\right)}\enspace .
\end{aligned} %]]></script>
<p>Let’s take a deep breath and tackle the second term first:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } \frac{p(\beta, \sigma^2)}{q(\beta, \sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{p(\beta \mid \sigma^2)}{q(\beta)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\left(2\pi\sigma^2\tau^2\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2\tau^2} \beta^2\right)}{\left(2\pi\sigma^2_\beta\right)^{-\frac{1}{2}}\text{exp}\left(-\frac{1}{2\sigma^2_\beta} (\beta - \mu_\beta)^2\right)}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\mathbb{E}_{q(\beta)}\left[\text{log } \frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\frac{1}{\sigma^2\tau^2} \beta^2}{\frac{1}{\sigma^2_\beta} (\beta - \mu_\beta)^2}\right] + \text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \mathbb{E}_{q(\sigma^2)}\left[\text{log}\frac{\sigma^2\tau^2}{\sigma^2_\beta} + \frac{\sigma^2_\beta + \mu_\beta^2}{\sigma^2\tau^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \text{log}\frac{\tau^2}{\sigma^2_\beta} + \mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] \\[.5em]
&= \text{log}\frac{\tau^2}{\sigma^2_\beta} + \mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\mathbb{E}_{q(\sigma^2)}\left[\frac{1}{\sigma^2}\right] + \mathbb{E}_{q(\sigma^2)}\left[\text{log } p(\sigma^2)\right] - \mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]\enspace .
\end{aligned} %]]></script>
<p>Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration:</p>
<!-- We proceed by expanding the last expectation: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{p(\sigma^2)}{q(\sigma^2)}\right] &= \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{\sigma^{-2}}{\frac{\nu^{\frac{n + 1}{2}}}{\Gamma\left(\frac{n + 1}{2}\right)}\left(\sigma^2\right)^{-\frac{n + 1}{2} - 1} \text{exp}\left(-\frac{1}{\sigma^2} \nu\right)}\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \mathbb{E}_{q(\sigma^2)}\left[\text{log } \frac{1}{\left(\sigma^2\right)^{-\frac{n + 1}{2}}} - \frac{\sigma^2}{\nu}\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\nu}\mathbb{E}_{q(\sigma^2)}\left[\sigma^2\right] \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\nu} \frac{\nu}{\frac{n + 1}{2} - 1} \\[.5em] -->
<!-- &= \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\frac{n + 1}{2} - 1} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<script type="math/tex; mode=display">\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right] = \int q(\sigma^2) \, \text{log } q(\sigma^2) \, \mathrm{d}\sigma^2 \approx \frac{1}{N} \sum_{i=1}^N \underbrace{\text{log } q(\sigma^2)}_{\sigma^2 \, \sim \, q(\sigma^2)} \enspace ,</script>
<p>We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbb{E}_{q(\beta, \sigma^2)}\left[\text{log } p(\mathbf{y} \mid \beta, \sigma^2)\right] &= \int \int q(\beta) \, q(\sigma^2) \, \text{log } p(\mathbf{y} \mid \beta, \sigma^2) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]
&=\int q(\beta) \int q(\sigma^2) \, \text{log} \left(\left(2\pi\sigma^2\right)^{-\frac{n}{2}}\text{exp}\left(-\frac{1}{2\sigma^2}
\sum_{i=1}^n (y_i - x_i\beta)^2\right)\right) \, \mathrm{d}\sigma \mathrm{d}\beta \\[.5em]
&= \frac{n}{4} \text{log}\left(2\pi\right)\int q(\beta) \left(\sum_{i=1}^n (y_i - x_i\beta)^2\right) \, \mathrm{d}\beta\int q(\sigma^2) \, \, \text{log} \left(\sigma^2\right)\frac{1}{\sigma^2} \, \mathrm{d}\sigma \enspace .
\end{aligned} %]]></script>
<p>Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically:</p>
<script type="math/tex; mode=display">\mathbb{E}_{q(\beta)}\left[\sum_{i=1}^n (y_i - x_i\beta)^2\right] = \sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right) \enspace .</script>
<!-- Piecing it all together, the ELBO is given by: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{ELBO}(\mu_\beta, \sigma_\beta^2, \tau^2, \tau^2) &= \frac{n}{4} \text{log}\left(2\pi\right)\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2\right)\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log} \left(\sigma^2\right)\frac{1}{\sigma^2}\right]\\[.5em] -->
<!-- &+ \text{log}\frac{\tau^2}{\sigma^2_\beta}\mathbb{E}_{q(\sigma^2)}\left[\text{log }\sigma^2\right] + \frac{\sigma^2_\beta + \mu_\beta^2}{\tau^2}\frac{n + 1}{2} \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right)\right)^{-1} \\[.5em] -->
<!-- &+ \text{log }\frac{\Gamma\left(\frac{n + 1}{2}\right)}{\nu^{\frac{n + 1}{2}}} + \left(\frac{n + 1}{2}\right)\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right] - \frac{1}{\frac{n + 1}{2} - 1} \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<p>In the next section, we implement the algorithm for our linear regression problem in R.</p>
<h1 id="implementation-in-r">Implementation in R</h1>
<p>Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'MCMCpack'</span><span class="p">)</span><span class="w">
</span><span class="cd">#' Computes the ELBO for the linear regression example</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @param y univariate outcome variable</span><span class="w">
</span><span class="cd">#' @param x univariate predictor variable</span><span class="w">
</span><span class="cd">#' @param beta_mu mean of the variational density for \beta</span><span class="w">
</span><span class="cd">#' @param beta_sd standard deviation of the variational density for \beta</span><span class="w">
</span><span class="cd">#' @param nu parameter of the variational density for \sigma^2</span><span class="w">
</span><span class="cd">#' @param nr_samples number of samples for the Monte carlo integration</span><span class="w">
</span><span class="cd">#' @returns ELBO</span><span class="w">
</span><span class="n">compute_elbo</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e4</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_y2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_yx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="c1"># Takes a function and computes its expectation with respect to q(\beta)</span><span class="w">
</span><span class="n">E_q_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dnorm</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">fn</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="o">-</span><span class="kc">Inf</span><span class="p">,</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Takes a function and computes its expectation with respect to q(\sigma^2)</span><span class="w">
</span><span class="n">E_q_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">fn</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">fn</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="p">},</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="kc">Inf</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Compute expectations of log p(\sigma^2)</span><span class="w">
</span><span class="n">E_log_p_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="c1"># Compute expectations of log p(\beta \mid \sigma^2)</span><span class="w">
</span><span class="n">E_log_p_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="w">
</span><span class="nf">log</span><span class="p">(</span><span class="n">tau2</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w">
</span><span class="p">(</span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">tau2</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">tau2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="c1"># Compute expectations of the log variational densities q(\beta)</span><span class="w">
</span><span class="n">E_log_q_beta</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_beta</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">beta</span><span class="p">)</span><span class="w"> </span><span class="n">dnorm</span><span class="p">(</span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">log</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">TRUE</span><span class="p">))</span><span class="w">
</span><span class="c1"># E_log_q_sigma2 <- E_q_sigma2(function(x) log(dinvgamma(x, (n + 1)/2, nu))) # fails</span><span class="w">
</span><span class="c1"># Compute expectations of the log variational densities q(\sigma^2)</span><span class="w">
</span><span class="n">sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rinvgamma</span><span class="p">(</span><span class="n">nr_samples</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w">
</span><span class="n">E_log_q_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">log</span><span class="p">(</span><span class="n">dinvgamma</span><span class="p">(</span><span class="n">sigma2</span><span class="p">,</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">/</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">)))</span><span class="w">
</span><span class="c1"># Compute the expected log likelihood</span><span class="w">
</span><span class="n">E_log_y_b</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_y2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">sum_yx</span><span class="o">*</span><span class="n">beta_mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">beta_sd</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_mu</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="n">sum_x2</span><span class="w">
</span><span class="n">E_log_y_sigma2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_q_sigma2</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">sigma</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">E_log_y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">n</span><span class="o">/</span><span class="m">4</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">log</span><span class="p">(</span><span class="m">2</span><span class="o">*</span><span class="nb">pi</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_log_y_b</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_log_y_sigma2</span><span class="w">
</span><span class="c1"># Compute and return the ELBO</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E_log_y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E_log_p_beta</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">E_log_p_sigma2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">E_log_q_beta</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">E_log_q_sigma2</span><span class="w">
</span><span class="n">ELBO</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The function below implements coordinate ascent mean-field variational inference for our simple linear regression problem. Recall that the variational parameters are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\nu &= \frac{1}{2}\left(\sum_{i=1}^n y_i^2- 2 \sum_{i=1}^n y_i x_i \, \mu_{\beta} + \left(\sigma^2_{\beta} + \mu_{\beta}^2\right)\left(\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}\right)\right) \\[.5em]
\mu_\beta &= \frac{\sum_{i=1}^N y_i x_i}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \\[.5em]
\sigma^2_\beta &= \frac{\left(\frac{n + 1}{2}\right) \nu^{-1}}{\sum_{i=1}^n x_i^2 + \frac{1}{\tau^2}} \enspace .
\end{aligned} %]]></script>
<p>The following function implements the iterative updating of these variational parameters until the ELBO has converged.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="cd">#' Implements CAVI for the linear regression example</span><span class="w">
</span><span class="cd">#' </span><span class="w">
</span><span class="cd">#' @param y univariate outcome variable</span><span class="w">
</span><span class="cd">#' @param x univariate predictor variable</span><span class="w">
</span><span class="cd">#' @param tau2 prior variance for the standardized effect size</span><span class="w">
</span><span class="cd">#' @returns parameters for the variational densities and ELBO</span><span class="w">
</span><span class="n">lmcavi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e5</span><span class="p">,</span><span class="w"> </span><span class="n">epsilon</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1e-2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="n">sum_y2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">y</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_x2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">sum_yx</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">x</span><span class="o">*</span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="c1"># is not being updated through variational inference!</span><span class="w">
</span><span class="n">beta_mu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_yx</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">5</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_mu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">beta_mu</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">has_converged</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="nf">abs</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="n">epsilon</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">compute_elbo</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">5</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w">
</span><span class="c1"># while the ELBO has not converged</span><span class="w">
</span><span class="k">while</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="n">has_converged</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]][</span><span class="n">j</span><span class="p">],</span><span class="w"> </span><span class="n">ELBO</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">nu_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]][</span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="n">beta_sd_prev</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]][</span><span class="n">j</span><span class="p">]</span><span class="w">
</span><span class="c1"># used in the update of beta_sd and nu</span><span class="w">
</span><span class="n">E_qA</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sum_y2</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="m">2</span><span class="o">*</span><span class="n">sum_yx</span><span class="o">*</span><span class="n">beta_mu</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="p">(</span><span class="n">beta_sd_prev</span><span class="o">^</span><span class="m">2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta_mu</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="o">*</span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">)</span><span class="w">
</span><span class="c1"># update the variational parameters for sigma2 and beta</span><span class="w">
</span><span class="n">nu</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">E_qA</span><span class="w">
</span><span class="n">beta_sd</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(((</span><span class="n">n</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">E_qA</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="p">(</span><span class="n">sum_x2</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="o">/</span><span class="n">tau2</span><span class="p">))</span><span class="w">
</span><span class="c1"># update results object</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'nu'</span><span class="p">]],</span><span class="w"> </span><span class="n">nu</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'beta_sd'</span><span class="p">]],</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">res</span><span class="p">[[</span><span class="s1">'ELBO'</span><span class="p">]],</span><span class="w"> </span><span class="n">ELBO</span><span class="p">)</span><span class="w">
</span><span class="c1"># compute new ELBO</span><span class="w">
</span><span class="n">j</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">j</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="n">ELBO</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">compute_elbo</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">beta_mu</span><span class="p">,</span><span class="w"> </span><span class="n">beta_sd</span><span class="p">,</span><span class="w"> </span><span class="n">nu</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="p">,</span><span class="w"> </span><span class="n">nr_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_samples</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">res</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Let’s run this on a simulated data set of size $n = 100$ with a true coefficient of $\beta = 0.30$ and a true error variance of $\sigma^2 = 1$. We assign $\beta$ a Gaussian prior with variance $\tau^2 = 0.25$ so that values for $\lvert \beta \rvert$ larger than two standard deviations ($0.50$) <a href="(https://en.wikipedia.org/wiki/68%E2%80%9395%E2%80%9399.7_rule)">receive about $0.68$</a> prior probability.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">gen_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">beta</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">0</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">beta</span><span class="o">*</span><span class="n">x</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">gen_dat</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">0.30</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">mc</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">lmcavi</span><span class="p">(</span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">tau2</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="o">^</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">mc</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## $nu
## [1] 5.00000 88.17995 45.93875 46.20205 46.19892 46.19895
##
## $beta_mu
## [1] 0.2800556
##
## $beta_sd
## [1] 1.00000000 0.08205605 0.11368572 0.11336132 0.11336517 0.11336512
##
## $ELBO
## [1] 0.0000 -297980.0495 493.4807 -281.4578 -265.1289 -265.3197</code></pre></figure>
<p>From the output, we see that the ELBO and the variational parameters have converged. In the next section, we compare these results to results obtained with Stan.</p>
<h2 id="comparison-with-stan">Comparison with Stan</h2>
<p>Whenever one goes down a rabbit hole of calculations, it is good to sanity check one’s results. Here, we use Stan’s variational inference scheme to check whether our results are comparable. It assumes a Gaussian variational density for each parameter after transforming them to the real line and automates inference in a “black-box” way so that no problem-specific calculations are required (see Kucukelbir, Ranganath, Gelman, & Blei, 2015). Subsequently, we compare our results to the exact posteriors arrived by Markov chain Monte carlo. The simple linear regression model in Stan is:</p>
<figure class="highlight"><pre><code class="language-stan" data-lang="stan">data {
int<lower=0> n;
vector[n] y;
vector[n] x;
real tau;
}
parameters {
real b;
real<lower=0> sigma;
}
model {
target += -log(sigma);
target += normal_lpdf(b | 0, sigma*tau);
target += normal_lpdf(y | b*x, sigma);
}</code></pre></figure>
<p>We use Stan’s black-box variational inference scheme:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'rstan'</span><span class="p">)</span><span class="w">
</span><span class="c1"># save the above model to a file and compile it</span><span class="w">
</span><span class="n">model</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">stan_model</span><span class="p">(</span><span class="n">file</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'stan-compiled/variational-regression.stan'</span><span class="p">)</span><span class="w">
</span><span class="n">stan_dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'n'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">dat</span><span class="p">),</span><span class="w"> </span><span class="s1">'x'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="s1">'y'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">dat</span><span class="o">$</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="s1">'tau'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">)</span><span class="w">
</span><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">vb</span><span class="p">(</span><span class="w">
</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_dat</span><span class="p">,</span><span class="w"> </span><span class="n">output_samples</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20000</span><span class="p">,</span><span class="w"> </span><span class="n">adapt_iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10000</span><span class="p">,</span><span class="w">
</span><span class="n">init</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">list</span><span class="p">(</span><span class="s1">'b'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.30</span><span class="p">,</span><span class="w"> </span><span class="s1">'sigma'</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>This gives similar estimates as ours:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Inference for Stan model: variational-regression.
## 1 chains, each with iter=20000; warmup=0; thin=1;
## post-warmup draws per chain=20000, total post-warmup draws=20000.
##
## mean sd 2.5% 25% 50% 75% 97.5%
## b 0.28 0.13 0.02 0.19 0.28 0.37 0.54
## sigma 0.99 0.09 0.82 0.92 0.99 1.05 1.18
## lp__ 0.00 0.00 0.00 0.00 0.00 0.00 0.00
##
## Approximate samples were drawn using VB(meanfield) at Thu Mar 19 10:45:28 2020.</code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## We recommend genuine 'sampling' from the posterior distribution for final inferences!</code></pre></figure>
<p>Their recommendation is prudent. If you run the code with different seeds, you can get quite different results. For example, the posterior mean of $\beta$ can range from $0.12$ to $0.45$, and the posterior standard deviation can be as low as $0.03$; in all these settings, Stan indicates that the ELBO has converged, but it seems that it has converged to a different local optimum for each run. (For seed = 3, Stan gives completely nonsensical results). Stan warns that the algorithm is experimental and may be unstable, and it is probably wise to not use it in production.</p>
<p><em>Update</em>: As Ben Goodrich points out in the comments, there is some cool work on providing diagnostics for variational inference; see <a href="https://statmodeling.stat.columbia.edu/2018/06/27/yes-work-evaluating-variational-inference/">this</a> blog post and the paper by Yao, Vehtari, Simpson, & Gelman (<a href="https://arxiv.org/abs/1802.02538">2018</a>) as well as the paper by Huggins, Kasprzak, Campbell, & Broderik (<a href="https://arxiv.org/abs/1910.04102">2019</a>).</p>
<p>Although the posterior distribution for $\beta$ and $\sigma^2$ is available in closed-form (see the <em>Post Scriptum</em>), we check our results against exact inference using Markov chain Monte carlo by visual inspection.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">sampling</span><span class="p">(</span><span class="n">model</span><span class="p">,</span><span class="w"> </span><span class="n">data</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">stan_dat</span><span class="p">,</span><span class="w"> </span><span class="n">iter</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">8000</span><span class="p">,</span><span class="w"> </span><span class="n">refresh</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w"> </span><span class="n">seed</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<p>The Figure below overlays our closed-form results to the histogram of posterior samples obtained using Stan.</p>
<p><img src="/assets/img/2019-10-30-Variational-Inference.Rmd/unnamed-chunk-10-1.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" style="display: block; margin: auto;" /></p>
<p>Note that the posterior variance of $\beta$ is slightly <em>overestimated</em> when using our variational scheme. This is in contrast to the fact that variational inference generally <em>underestimates</em> variances. Note also that Bayesian inference using Markov chain Monte Carlo is very fast on this simple problem. However, the comparative advantage of variational inference becomes clear by increasing the sample size: for sample sizes as large as $n = 100000$, our variational inference scheme takes less then a tenth of a second!</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have seen how to turn an integration problem into an optimization problem using variational inference. Assuming that the variational densities are independent, we have derived the optimal variational densities for a simple linear regression problem with one predictor. While using variational inference for this problem is unnecessary since everything is available in closed-form, I have focused on such a simple problem so as to not confound this introduction to variational inference by the complexity of the model. Still, the derivations were quite lengthy. They were also entirely specific to our particular problem, and thus generic “black-box” algorithms which avoid problem-specific calculations hold great promise.</p>
<p>We also implemented coordinate ascent mean-field variational inference (CAVI) in R and compared our results to results obtained via variational and exact inference using Stan. We have found that one probably should not trust Stan’s variational inference implementation, and that our results closely correspond to the exact procedure. For more on variational inference, I recommend the excellent review article by Blei, Kucukelbir, and McAuliffe (<a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773">2017</a>).</p>
<hr />
<p><em>I would like to thank Don van den Bergh for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="normal-inverse-gamma-distribution">Normal-inverse-gamma Distribution</h3>
<p>The posterior distribution is a <a href="https://en.wikipedia.org/wiki/Normal-inverse-gamma_distribution">Normal-inverse-gamma distribution</a>:</p>
<script type="math/tex; mode=display">p(\beta, \sigma^2 \mid \mathbf{y}) = \frac{\gamma^{\alpha}}{\Gamma\left(\alpha\right)} \left(\sigma^2\right)^{-\alpha - 1} \text{exp}\left(-\frac{2\gamma + \lambda\left(\beta - \mu\right)^2}{2\sigma^2}\right) \enspace ,</script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mu &= \frac{\sum_{i=1}^n y_i x_i}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}} \\[.5em]
\lambda &= \sum_{i=1}^n x_i + \frac{1}{\tau^2} \\[.5em]
\alpha &= \frac{n + 1}{2} \\[.5em]
\gamma &= \left(\frac{1}{2}\left(\sum_{i=1}^n y_i^2 - \frac{\left(\sum_{i=1}^n y_i x_i\right)^2}{\sum_{i=1}^n x_i + \frac{1}{\tau^2}}\right)\right) \enspace .
\end{aligned} %]]></script>
<p>Note that the marginal posterior distribution for $\beta$ is actually a Student-t distribution, contrary to what we assume in our variational inference scheme.</p>
<h2 id="references">References</h2>
<ul>
<li>Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (<a href="https://www.tandfonline.com/doi/full/10.1080/01621459.2017.1285773">2017</a>). Variational inference: A review for statisticians. <em>Journal of the American Statistical Association, 112</em>(518), 859-877.</li>
<li>Huggins, J. H., Kasprzak, M., Campbell, T., & Broderick, T. (<a href="https://arxiv.org/abs/1910.04102">2019</a>). Practical Posterior Error Bounds from Variational Objectives. <em>arXiv preprint</em> arXiv:1910.04102.</li>
<li>Kucukelbir, A., Ranganath, R., Gelman, A., & Blei, D. (<a href="http://papers.nips.cc/paper/5758-automatic-variational-inference-in-stan">2015</a>). Automatic variational inference in Stan. In <em>Advances in Neural Information Processing Systems</em> (pp. 568-576).</li>
<li>Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., & Blei, D. M. (<a href="http://www.jmlr.org/papers/volume18/16-107/16-107.pdf">2017</a>). Automatic differentiation variational inference. <em>The Journal of Machine Learning Research, 18</em>(1), 430-474.</li>
<li>Yao, Y., Vehtari, A., Simpson, D., & Gelman, A. (<a href="https://arxiv.org/abs/1802.02538">2018</a>). Yes, but did it work?: Evaluating variational inference. <em>arXiv preprint</em> arXiv:1802.02538.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderBayesian inference using Markov chain Monte Carlo methods can be notoriously slow. In this blog post, we reframe Bayesian inference as an optimization problem using variational inference, markedly speeding up computation. We derive the variational objective function, implement coordinate ascent mean-field variational inference for a simple linear regression example in R, and compare our results to results obtained via variational and exact inference using Stan. Sounds like word salad? Then let’s start unpacking! Preliminaries Bayes’ rule states that where $\mathbf{z}$ denotes latent parameters we want to infer and $\mathbf{x}$ denotes data.1 Bayes’ rule is, in general, difficult to apply because it requires dealing with a potentially high-dimensional integral — the marginal likelihood. Optimization, which involves taking derivatives instead of integrating, is much easier and generally faster than the latter, and so our goal will be to reframe this integration problem as one of optimization. Variational objective We want to get at the posterior distribution, but instead of sampling we simply try to find a density $q^\star(\mathbf{z})$ from a family of densities $\mathrm{Q}$ that best approximates the posterior distribution: where $\text{KL}(. \lvert \lvert.)$ denotes the Kullback-Leibler divergence: We cannot compute this Kullback-Leibler divergence because it still depends on the nasty integral $p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) \, p(\mathbf{z}) \, \mathrm{d}\mathbf{z}$. To see this dependency, observe that: where we have expanded the expectation to more clearly behold our nemesis. In doing so, we have seen that $\text{log } p(\mathbf{x})$ is actually a constant with respect to $q(\mathbf{z})$; this means that we can ignore it in our optimization problem. Moreover, minimizing a quantity means maximizing its negative, and so we maximize the following quantity: We can expand the joint probability to get more insight into this equation: This is cool. It says that maximizing the ELBO finds an approximate distribution $q(\mathbf{z})$ for latent quantities $\mathbf{z}$ that allows the data to be predicted well, i.e., leads to a high expected log likelihood, but that a penalty is incurred if $q(\mathbf{z})$ strays far away from the prior $p(\mathbf{z})$. This mirrors the usual balance in Bayesian inference between likelihood and prior (Blei, Kucukelbier, & McAuliffe, 2017). ELBO stands for evidence lower bound. The marginal likelihood is sometimes called evidence, and we see that ELBO is indeed a lower bound for the evidence: since the Kullback-Leibler divergence is non-negative. Heuristically, one might then use the ELBO as a way to select between models. For more on predictive model selection, see this and this blog post. Why variational? Our optimization problem is about finding $q^\star(\mathbf{z})$ that best approximates the posterior distribution. This is in contrast to more familiar optimization problems such as maximum likelihood estimation where one wants to find, for example, the single best value that maximizes the log likelihood. For such a problem, one can use standard calculus (see for example this blog post). In our setting, we do not want to find a single best value but rather a single best function. To do this, we can use variational calculus from which variational inference derives its name (Bishop, 2006, p. 462). A function takes an input value and returns an output value. We can define a functional which takes a whole function and returns an output value. The entropy of a probability distribution is a widely used functional: which takes as input the probability distribution $p(x)$ and returns a single value, its entropy. In variational inference, we want to find the function that minimizes the ELBO, which is a functional. In order to make this optimization problem more manageable, we need to constrain the functions in some way. One could, for example, assume that $q(\mathbf{z})$ is a Gaussian distribution with parameter vector $\omega$. The ELBO then becomes a function of $\omega$, and we employ standard optimization methods to solve this problem. Instead of restricting the parametric form of the variational distribution $q(\mathbf{z})$, in the next section we use an independence assumption to manage the inference problem. Mean-field variational family A frequently used approximation is to assume that the latent variables $z_j$ for $j = \{1, \ldots, m\}$ are mutually independent, each governed by their own variational density: Note that this mean-field variational family cannot model correlations in the posterior distribution; by construction, the latent parameters are mutually independent. Observe that we do not make any parametric assumption about the individual $q_j(z_j)$. Instead, their parametric form is derived for every particular inference problem. We start from our definition of the ELBO and apply the mean-field assumption: In the following, we optimize the ELBO with respect to a single variational density $q_j(z_j)$ and assume that all others are fixed: One could use variational calculus to derive the optimal variational density $q_j^\star(z_j)$; instead, we follow Bishop (2006, p. 465) and define the distribution where we need to make sure that it integrates to one by subtracting the (log) normalizing constant $\mathcal{Z}$. With this in mind, observe that: Thus, maximizing the ELBO with respect to $q_j(z_j)$ is minimizing the Kullback-leibler divergence between $q_j(z_j)$ and $\tilde{p}(\mathbf{x}, z_j)$; it is zero when the two distributions are equal. Therefore, under the mean-field assumption, the optimal variational density $q_j^\star(z_j)$ is given by: see also Bishop (2006, p. 466). This is not an explicit solution, however, since each optimal variational density depends on all others. This calls for an iterative solution in which we first initialize all factors $q_j(z_i)$ and then cycle through them, updating them conditional on the updates of the other. Such a procedure is known as Coordinate Ascent Variational Inference (CAVI). Further, note that which allows us to write the updates in terms of the conditional posterior distribution of $z_j$ given all other factors $\mathbf{z}_{-j}$. This looks a lot like Gibbs sampling, which we discussed in detail in a previous blog post. In the next section, we implement CAVI for a simple linear regression problem. Application: Linear regression In a previous blog post, we traced the history of least squares and applied it to the most basic problem: fitting a straight line to a number of points. Here, we study the same problem but swap optimization procedure: instead of least squares or maximum likelihood, we use variational inference. Our linear regression setup is: where we assume that the population mean of $y$ is zero (i.e., $\beta_0 = 0$); and we assign the error variance $\sigma^2$ an improper Jeffreys’ prior and $\beta$ a Gaussian prior with variance $\sigma^2\tau^2$. We scale the prior of $\beta$ by the error variance to reason in terms of a standardized effect size $\beta / \sigma$ since with this specification: As a heads up, we have to do a surprising amount of calculations to implement variational inference even for this simple problem. In the next section, we start our journey by deriving the variational density for $\sigma^2$. Variational density for $\sigma^2$ Our optimal variational density $q^\star(\sigma^2)$ is given by: To get started, we need to derive the conditional posterior distribution $p(\sigma^2 \mid \mathbf{y}, \beta)$. We write: which is proportional to an inverse Gamma distribution. Moving on, we exploit the linearity of the expectation and write: This, too, looks like an inverse Gamma distribution! Plugging in the normalizing constant, we arrive at: Note that this quantity depends on $\beta$. In the next section, we derive the variational density for $\beta$. Variational density for $\beta$ Our optimal variational density $q^\star(\beta)$ is given by: and so we again have to derive the conditional posterior distribution $p(\beta \mid \mathbf{y}, \sigma^2)$. We write: where we have “completed the square” (see also this blog post) and realized that the conditional posterior is Gaussian. We continue by taking expectations: which is again proportional to a Gaussian distribution! Plugging in the normalizing constant yields: Note that while the variance of this distribution, $\sigma^2_\beta$, depends on $q(\sigma^2)$, its mean $\mu_\beta$ does not. To recap, instead of assuming a parametric form for the variational densities, we have derived the optimal densities under the mean-field assumption, that is, under the assumption that the parameters are independent: $q(\beta, \sigma^2) = q(\beta) \, q(\sigma^2)$. Assigning $\beta$ a Gaussian distribution and $\sigma^2$ a Jeffreys’s prior, we have found that the variational density for $\sigma^2$ is an inverse Gamma distribution and that the variational density for $\beta$ a Gaussian distribution. We noted that these variational densities depend on each other. However, this is not the end of the manipulation of symbols; both distributions still feature an expectation we need to remove. In the next section, we expand the remaining expectations. Removing expectations Now that we know the parametric form of both variational densities, we can expand the terms that involve an expectation. In particular, to remove the expectation in the variational density for $\sigma^2$, we write: Noting that $\mathbb{E}_{q(\beta)}[\beta] = \mu_{\beta}$ and using the fact that: the expectation becomes: For the expectation which features in the variational distribution for $\beta$, things are slightly less elaborate, although the result also looks unwieldy. We write: Monitoring convergence The algorithm works by first specifying initial values for the parameters of the variational densities and then iteratively updating them until the ELBO does not change anymore. This requires us to compute the ELBO, which we still need to derive, on each update. We write: Let’s take a deep breath and tackle the second term first: Note that there are three expectations left. However, we really deserve a break, and so instead of analytically deriving the expectations we compute $\mathbb{E}_{q(\sigma^2)}\left[\text{log } \sigma^2\right]$ and $\mathbb{E}_{p(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$ numerically using Gaussian quadrature. This fails for $\mathbb{E}_{q(\sigma^2)}\left[\text{log } q(\sigma^2)\right]$, which we compute using Monte carlo integration: We are left with the expected log likelihood. Instead of filling this blog post with more equations, we again resort to numerical methods. However, we refactor the expression so that numerical integration is more efficient: Since we have solved a similar problem already above, we evaluate the expecation with respect to $q(\beta)$ analytically: In the next section, we implement the algorithm for our linear regression problem in R. Implementation in R Now that we have derived the optimal densities, we know how they are parameterized. Therefore, the ELBO is a function of these variational parameters and the parameters of the priors, which in our case is just $\tau^2$. We write a function that computes the ELBO: The first part of this blog post draws heavily on the excellent review article by Blei, Kucukelbier, and McAuliffe (2017), and so I use their (machine learning) notation. ↩Harry Potter and the Power of Bayesian Constrained Inference2019-09-28T08:00:00+00:002019-09-28T08:00:00+00:00https://fabiandablander.com/r/Bayes-Potter<p>If you are reading this, you are probably a Ravenclaw. Or a Hufflepuff. Certainly not a Slytherin … but maybe a Gryffindor?</p>
<p>In this blog post, we let three subjective Bayesians predict the outcome of ten coin flips. We will derive prior predictions, evaluate their accuracy, and see how fortune favours the bold. We will also discover a neat trick that allows one to easily compute Bayes factors for models with parameter restrictions compared to models without such restrictions, and use it to answer a question we truly care about: are Slytherins really the bad guys?</p>
<h1 id="preliminaries">Preliminaries</h1>
<p>As in a <a href="https://fabiandablander.com/r/Regularization.html">previous blog post</a>, we start by studying coin flips. Let $\theta \in [0, 1]$ be the bias of the coin and let $y$ denote the number of heads out of $n$ coin flips. We use the Binomial likelihood</p>
<script type="math/tex; mode=display">p(y \mid \theta) = {n \choose y} \theta^y (1 - \theta)^{n - y} \enspace ,</script>
<p>and a Beta prior for $\theta$:</p>
<script type="math/tex; mode=display">p(\theta) = \frac{1}{\text{B}(a, b)} \theta^{a - 1} (1 - \theta)^{b - 1} \enspace .</script>
<p>This prior is <em>conjugate</em> for this likelihood which means that the posterior is again a Beta distribution. The Figure below shows two examples of this.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>In this blog post, we will use a <em>prior predictive</em> perspective on model comparison by means of Bayes factors. For an extensive contrast with a perspective based on <em>posterior prediction</em>, see <a href="https://fabiandablander.com/r/Law-of-Practice.html">this blog post</a>. The Bayes factor indicates how much better a model $\mathcal{M}_1$ predicts the data $y$ <em>relative to another model</em> $\mathcal{M}_0$:</p>
<script type="math/tex; mode=display">\text{BF}_{10} = \frac{p(y \mid \mathcal{M}_1)}{p(y \mid \mathcal{M}_0)} \enspace ,</script>
<p>where we can write the <em>marginal likelihood</em> of a generic model $\mathcal{M}$ more complicatedly to see the dependence on the model’s priors:</p>
<script type="math/tex; mode=display">p(y \mid \mathcal{M}) = \int_{\Theta} p(y \mid \theta, \mathcal{M}) \, p(\theta \mid \mathcal{M}) \, \mathrm{d}\theta \enspace .</script>
<p>After these preliminaries, in the next section, we visit Ron, Harry, and Hermione in Hogwarts.</p>
<h1 id="the-hogwarts-prediction-contest">The Hogwarts prediction contest</h1>
<p>Ron, Harry, and Hermione just came back from a straining adventure — Death Eaters and all. They deserve a break, and Hermione suggests a small prediction contest to relax. Ron is put off initially; relaxing by thinking? That’s not his style. Harry does not care either way; both are eventually convinced.</p>
<p>The goal of the contest is to accuratly predict the outcome of $n = 10$ coin flips. Luckily, this is not a particularly complicated problem to model, and we can use the Binomial likelihood we have discussed above. In the next section, Ron, Harry, and Hermione — all subjective Bayesians — clearly state their prior beliefs which is required to make predictions.</p>
<h2 id="prior-beliefs">Prior beliefs</h2>
<p>Ron is not big on thinking, and so trusts his previous intuitions that coins are usually unbiased; he specifies a point mass on $\theta = 0.50$ as his prior. Harry spreads his bets evenly, and believes that all chances governing the coin flip’s outcome are equally likely; he puts a uniform prior on $\theta$. Hermione, on the other hand, believes that the coin <em>cannot</em> be biased towards tails; instead, she believes that all values $\theta \in [0.50, 1]$ are equally likely. She thinks this because Dobby — the house elf — is the one who throws the coin, and she has previously observed him passing time by flipping coins, which strangely almost always landed up heads. To sum up, their priors are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\text{Ron} &: \theta = 0.50 \\[.5em]
\text{Harry} &: \theta \sim \text{Beta}(1, 1) \\[.5em]
\text{Hermione} &: \theta \sim \text{Beta}(1, 1)\mathbb{I}(0.50, 1) \enspace ,
\end{aligned} %]]></script>
<p>which are visualized in the Figure below.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In the next section, the three use their beliefs to make probabilistic predictions.</p>
<h2 id="prior-predictions">Prior predictions</h2>
<p>Ron, Harry, and Hermione are subjective Bayesians and therefore evaluate their performance by their respective predictive accuracy. Each of the trio has a <em>prior predictive distribution</em>. For Ron, true to character, this is the easiest to derive. We associate model $\mathcal{M}_0$ with him and write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y \mid \mathcal{M}_0) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_0) \, p(\theta \mid \mathcal{M}_0) \, \mathrm{d}\theta \\[.5em]
&= {n \choose y} 0.50^y (1 - 0.50)^{n - y} \enspace ,
\end{aligned} %]]></script>
<p>where the integral — the sum! — is just over the value $\theta = 0.50$. Ron’s prior predictive distribution is simply a Binomial distribution. He is delighted by this fact, and enjoys a short rest while the others derive their predictions.</p>
<p>It is Harry’s turn, and he is a little put off by his integration problem. However, he realizes that the integrand is an unnormalized Beta distribution, and swiftly writes down its normalizing constant, the Beta function. Associating $\mathcal{M}_1$ with him, his steps are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y \mid \mathcal{M}_1) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \frac{1}{\text{B}(1, 1)} \theta^{1 - 1} (1 - \theta)^{1 - 1} \, \mathrm{d}\theta \\[.5em]
&= \int_{\Theta} {n \choose y} \theta^y (1 - \theta)^{n - y} \, \mathrm{d}\theta \\[.5em]
&= {n \choose y} \text{Beta}(y + 1, n - y + 1) \enspace ,
\end{aligned} %]]></script>
<p>which is a <a href="https://en.wikipedia.org/wiki/Beta-binomial_distribution">Beta-Binomial distribution</a> with $\alpha = \beta = 1$.</p>
<p>Hermione’s integral is the most complicated of the three, but she is also the smartest of the bunch. She is a master of the wizardry that is computer programming, which allows her to solve the integral numerically.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> We associate $\mathcal{M}_r$, which stands for <em>restricted</em> model, with her and write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(y \mid \mathcal{M}_r) &= \int_{\Theta} p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta \\[.5em]
&= \int_{0.50}^1 {n \choose y} \theta^y (1 - \theta)^{n - y} \, 2 \, \mathrm{d}\theta \\[.5em]
&= 2{n \choose y}\int_{0.50}^1 \theta^y (1 - \theta)^{n - y} \mathrm{d}\theta \enspace .
\end{aligned} %]]></script>
<p>We can draw from the prior predictive distributions by simulating from the prior and then making predictions through the likelihood. For Hermione, for example, this yields:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">nr_draws</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">20</span><span class="w">
</span><span class="n">theta_Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_draws</span><span class="p">,</span><span class="w"> </span><span class="n">min</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="n">max</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">predictions_Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">rbinom</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">nr_draws</span><span class="p">,</span><span class="w"> </span><span class="n">size</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="n">prob</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">theta_Hermione</span><span class="p">)</span><span class="w">
</span><span class="n">predictions_Hermione</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 10 10 10 3 7 10 8 9 6 9 9 6 8 9 8 10 6 10 5 7</code></pre></figure>
<p>Let’s visualize Ron’s, Harry’s, and Hermione’s prior predictive distributions to get a better feeling for what they believe are likely coin flip outcomes. First, we implement their prior predictions in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Ron</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">0.50</span><span class="o">^</span><span class="n">n</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Harry</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">beta</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">Hermione</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">10</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">int</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="n">theta</span><span class="o">^</span><span class="n">y</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">theta</span><span class="p">)</span><span class="o">^</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">y</span><span class="p">),</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="m">2</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">choose</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">int</span><span class="o">$</span><span class="n">value</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Even though Ron believes that $\theta = 0.50$, this does not mean that his prior prediction puts all mass on $y = 5$; deviations from this value are plausible. Harry’s prior predictive distribution also makes sense: since he believes all values for $\theta$ to be equally likely, he should believe all outcomes are equally likely. Hermione, on the other hand, believes that $\theta \in [0.50, 1]$, so her prior probabilities for outcomes with few heads ($y < 5$) drastically decrease.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>After the three have clearly stated their prior beliefs and derived their prior predictions, Dobby throws a coin ten times. The coin comes up heads nine times. In the next section, we discuss the relative predictive performance of Ron, Harry, and Hermione based on these data.</p>
<h2 id="evaluating-predictions">Evaluating predictions</h2>
<p>To assess the relative predictive performance of Ron, Harry, and Hermione, we need to compute the probability mass of $y = 9$ for their respective prior predictive distributions. Compared to Ron, Hermione did roughly 19 times better:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Hermione</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Ron</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 18.50909</code></pre></figure>
<p>Harry, on the other hand, did about 9 times better than Ron:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Harry</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Ron</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 9.309091</code></pre></figure>
<p>With these two comparisons, we also know by how much Hermione outperformed Harry, since by transitivity we have:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_0)} \times \frac{p(y \mid \mathcal{M}_0)}{p(y \mid \mathcal{M}_1)} = \text{BF}_{r0} \times \frac{1}{\text{BF}_{10}} \approx 2 \enspace ,</script>
<p>which is indeed correct:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Hermione</span><span class="p">(</span><span class="m">9</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Harry</span><span class="p">(</span><span class="m">9</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1.988281</code></pre></figure>
<p>Note that this is also immediately apparent from the visualizations above, where Hermione’s allocated probability mass is about twice as large as Harry’s for the case where $y = 9$.</p>
<p>Hermione was bold in her prediction, and was rewarded with being favoured by a factor of two in predictive performance. Note that if her predictions would have been even bolder, say restricting her prior to $\theta \in [0.80, 1]$, she would have reaped higher rewards than a Bayes factor in favour of two. Contrast this to Dobby throwing the coin ten times and with only one heads showing. Then Harry’s marginal likelihood is still $\text{Beta}(11, 1) = \frac{1}{11}$. However, Hermione’s is not twice as much; instead, it is a mere $0.001065$, which would result in a Bayes factor of about $85$ in favour of Harry!</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">Harry</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="n">Hermione</span><span class="p">(</span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 85.33333</code></pre></figure>
<p>This means that with bold predictions, one can also lose a lot. However, this is tremendously insightful, since Hermione would immediately realize where she went wrong. For a discussion that also points out the flexibility of Bayesian model comparison, see Etz, Haaf, Rouder, & Vandeckerckhove (2018).</p>
<p>In the next section, we will discover a nice trick which simplifies the computation of the Bayes factor; we do not need to derive marginal likelihoods, but can simply look at the prior and the posterior distribution of the parameter of interest in the unrestricted model.</p>
<h1 id="prior--posterior-trick">Prior / Posterior trick</h1>
<p>As it it turns out, the relative predictive performance of Hermione compared to Harry is given by the ratio of the purple area to the blue area in the figure below.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-10-1.png" title="plot of chunk unnamed-chunk-10" alt="plot of chunk unnamed-chunk-10" style="display: block; margin: auto;" /></p>
<p>In other words, the Bayes factor in favour of the <em>restricted</em> model (i.e., Hermione) compared to the <em>unrestricted</em> or <em>encompassing</em> model (i.e., Harry) is given by the posterior probability of $\theta$ being in line with the restriction compared to the prior probability of $\theta$ being in line with the restriction. We can check this numerically:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="c1"># (1 - pbeta(0.50, 10, 2)) / 0.50 would also work</span><span class="w">
</span><span class="n">integrate</span><span class="p">(</span><span class="k">function</span><span class="p">(</span><span class="n">theta</span><span class="p">)</span><span class="w"> </span><span class="n">dbeta</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">),</span><span class="w"> </span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="o">$</span><span class="n">value</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">0.50</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1.988281</code></pre></figure>
<p>This is a very cool result which, to my knowledge, was first described in Kluglist & Hoijtink (2005). In the next section, we prove it.</p>
<h2 id="proof">Proof</h2>
<p>The proof uses two insights. First, note that we can write the priors in the restricted model, $\mathcal{M}_r$, as priors in the encompassing model, $\mathcal{M}_1$, subject to some constraints. In the Hogwarts prediction context, Hermione’s prior was a restricted version of Harry’s prior:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\theta \mid \mathcal{M}_r) &= p(\theta \mid \mathcal{M}_1)\mathbb{I}(0.50, 1) \\[1em]
&= \begin{cases} \frac{p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} & \text{if} \hspace{1em} \theta \in [0.50, 1] \\[1em] 0 & \text{otherwise}\end{cases}
\end{aligned} %]]></script>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>We have to divide by the term</p>
<script type="math/tex; mode=display">K = \int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta = 0.50 \enspace ,</script>
<p>so that the restricted prior integrates to 1, as all proper probability distributions must. As a direct consequence, note that the density of a value $\theta = \theta^{\star}$ is given by:</p>
<script type="math/tex; mode=display">p(\theta^{\star} \mid \mathcal{M}_r) = p(\theta^{\star} \mid \mathcal{M}_1) \cdot \frac{1}{K} \enspace ,</script>
<p>where $K$ is the renormalization constant. This means that we can rewrite terms which include the restricted prior in terms of the unrestricted prior from the encompassing model. This also holds for the posterior!</p>
<p>To see that we can also write the restricted posterior in terms of the unrestricted posterior from the encompassing model, note that the likelihood is the same under both models and that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
p(\theta \mid y, \mathcal{M}_r) &= \frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r) \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K}}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \frac{1}{K} \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int_{0.50}^1 p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \\[.5em]
&= \frac{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta}}{\int_{0.50}^1 \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{\int p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} \, \mathrm{d}\theta} \\[.5em]
&= \frac{p(\theta \mid y, \mathcal{M}_1)}{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta} \enspace ,
\end{aligned} %]]></script>
<p>where we have to renormalize by</p>
<script type="math/tex; mode=display">Z = \int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta \enspace ,</script>
<p>which is</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">pbeta</span><span class="p">(</span><span class="m">0.50</span><span class="p">,</span><span class="w"> </span><span class="m">10</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 0.9941406</code></pre></figure>
<p>The figure below visualizes Harry’s and Hermione’s posterior. Sensibly, since Hermione excluded all $\theta \in [0, 0.50]$ in her prior, such values receive zero credence in her posterior. However, the difference in posterior distributions between Harry and Hermione is very weak in contrast to the difference in prior distribution. This is reflected in $Z$ being close to 1.</p>
<p><img src="/assets/img/2019-09-28-Bayes-Potter.Rmd/unnamed-chunk-14-1.png" title="plot of chunk unnamed-chunk-14" alt="plot of chunk unnamed-chunk-14" style="display: block; margin: auto;" /></p>
<p>Similar to the prior, we can write the density of a value $\theta = \theta^\star$ in terms of the encompassing model:</p>
<script type="math/tex; mode=display">p(\theta^{\star} \mid y, \mathcal{M}_r) = p(\theta^{\star} \mid y, \mathcal{M}_1) \cdot \frac{1}{Z} \enspace .</script>
<p>Now that we have established that we can write both the prior and the posterior density of parameters in the restricted model in terms of the parameters in the unrestricted model, as a second step, note that we can swap the posterior and the marginal likelihood terms in Bayes’ rule such that:</p>
<script type="math/tex; mode=display">p(y \mid \mathcal{M}_1) = \frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)} \enspace ,</script>
<p>from which it follows that:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{p(y \mid \mathcal{M}_r)}{p(y \mid \mathcal{M}_1)} = \frac{\frac{p(y \mid \theta, \mathcal{M}_r) \, p(\theta \mid \mathcal{M}_r)}{p(\theta \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta, \mathcal{M}_1) \, p(\theta \mid \mathcal{M}_1)}{p(\theta \mid y, \mathcal{M}_1)}} \enspace .</script>
<p>Now suppose that we have values that are in line with the restriction, i.e., $\theta = \theta^{\star}$. Then:</p>
<script type="math/tex; mode=display">\begin{aligned}
\text{BF}_{r1} = \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star\mid \mathcal{M}_r)}{p(\theta^\star \mid y, \mathcal{M}_r)}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}
= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, p(\theta^\star \mid \mathcal{M}_1) \, \frac{1}{K}}{p(\theta^\star \mid y, \mathcal{M}_1) \, \frac{1}{Z}}}{\frac{p(y \mid \theta^\star, \mathcal{M}_1) \, p(\theta^\star \mid \mathcal{M}_1)}{p(\theta^\star \mid y, \mathcal{M}_1)}}
= \frac{\frac{p(y \mid \theta^\star, \mathcal{M}_r) \, \frac{1}{K}}{\frac{1}{Z}}}{p(y \mid \theta^\star, \mathcal{M}_1)} = \frac{\frac{1}{K}}{\frac{1}{Z}} = \frac{Z}{K} \enspace ,
\end{aligned}</script>
<p>where we have used the previous insights and the fact that the likelihood under $\mathcal{M}_r$ and $\mathcal{M}_1$ is the same. If we expand the constants for our previous problem, we have:</p>
<script type="math/tex; mode=display">\text{BF}_{r1} = \frac{Z}{K} = \frac{\int_{0.50}^1 p(\theta \mid y, \mathcal{M}_1) \, \mathrm{d}\theta}{\int_{0.50}^1 p(\theta \mid \mathcal{M}_1) \, \mathrm{d}\theta} = \frac{p(\theta \in [0.50, 1] \mid y, \mathcal{M}_1)}{p(\theta \in [0.50, 1] \mid \mathcal{M}_1)} \enspace ,</script>
<p>which is, as claimed above, the posterior probability of values for $\theta$ that are in line with the restriction divided by the prior probability of values for $\theta$ that are in line with the restriction. Note that this holds for arbitrary restrictions of an arbitrary number of parameters (see Kluglist & Hoijtink, 2005). In the limit where we take the restriction to be infinitesimally small, that is, constrain the parameter to be a point value, this results in the Savage-Dickey density ratio (Wetzels, Grasman, & Wagenmakers, 2010).</p>
<!-- To illustrate this, assume that Hermione could have believed that $\theta$ is equally likely to be smaller $0.25$ or larger than $0.75$. Her prior and posterior are visualized in the figure below. -->
<!-- ```{r, echo = FALSE, fig.width = 10, fig.height = 5, fig.align = 'center', message = FALSE, warning = FALSE, dpi=400} -->
<!-- library('latex2exp') -->
<!-- x <- seq(.000, 1, .001) -->
<!-- par(mfrow = c(1, 2)) -->
<!-- Hermione_prior <- function(x) { -->
<!-- if (x < .25) { -->
<!-- res <- dunif(x, 0, 0.25) / 2 -->
<!-- } else { -->
<!-- res <- dunif(x, 0.75, 1) / 2 -->
<!-- } -->
<!-- res -->
<!-- } -->
<!-- Hermione_posterior <- function(x, y = 9, n = 10) { -->
<!-- fn <- function(x) { -->
<!-- Hermione_prior(x) * dbinom(y, n, x) -->
<!-- } -->
<!-- 2 * Hermione_prior(x) * dbinom(y, n, x) -->
<!-- } -->
<!-- plot( -->
<!-- x, sapply(x, Hermione_prior), xlim = c(0, 1), type = 'l', ylab = 'Density', lty = 1, -->
<!-- xlab = TeX('$\\theta$'), las = 1, main = 'Hermione\'s Prior', lwd = 3, ylim = c(0, 4), -->
<!-- cex.lab = 1.5, cex.main = 1.5, col = 'skyblue', axes = FALSE -->
<!-- ) -->
<!-- axis(1, at = seq(0, 1, .2)) #adds custom x axis -->
<!-- axis(2, las = 1) # custom y axis -->
<!-- plot( -->
<!-- x, sapply(x, Hermione_posterior), xlim = c(0, 1), type = 'l', ylab = 'Density', lty = 1, -->
<!-- xlab = TeX('$\\theta$'), las = 1, main = 'Hermione\'s Posterior', lwd = 3, ylim = c(0, 4), -->
<!-- cex.lab = 1.5, cex.main = 1.5, col = 'darkorchid1', axes = FALSE -->
<!-- ) -->
<!-- axis(1, at = seq(0, 1, .2)) #adds custom x axis -->
<!-- axis(2, las = 1) # custom y axis -->
<!-- ``` -->
<!-- The Bayes factor in favour of Hermione compared to Harry is given by: -->
<!-- ```{r} -->
<!-- K <- 2 -->
<!-- Z <- pbeta(0.25, 10, 2) + (1 - pbeta(0.75, 10, 2)) -->
<!-- Z / K -->
<!-- ``` -->
<p>In the next section, we apply this idea to a data set that relates Hogwarts Houses to personality traits.</p>
<h1 id="hogwarts-houses-and-personality">Hogwarts Houses and personality</h1>
<p>So, are you a Slytherin, Hufflepuff, Ravenclaw, or Gryffindor? And what does this say about your personality?</p>
<p>Inspired by Crysel et al. (2015), Lea Jakob, Eduardo Garcia-Garzon, Hannes Jarke, and I analyzed self-reported personality data from 847 people as well as their self-reported Hogwards House affiliation.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> We wanted to answer questions such as: do people who report belonging to Slytherin tend to score highest on Narcissism, Machiavellianism, and Psychopathy? Are Hufflepuffs the most agreeable, and Gryffindors the most extraverted? The Figure below visualizes the raw data.</p>
<div style="text-align:center;">
<img src="../assets/img/Potter-Personality.png" align="center" style="margin-top: -10px; padding-bottom: 0px;" width="680" height="540" />
</div>
<p>We used a between-subjects ANOVA as our model and, in the case of for example Agreeableness, compared the following hypotheses:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathcal{H}_0&: \mu_H = \mu_G = \mu_R = \mu_S \\[.5em]
\mathcal{H}_r&: \mu_H > (\mu_G , \mu_R , \mu_S) \\[.5em]
\mathcal{H}_1&: \mu_H , \mu_G , \mu_R , \mu_S
\end{aligned} %]]></script>
<p>We used the BayesFactor R package to compute the Bayes factor in favour of $\mathcal{H}_1$ compared to $\mathcal{H}_0$. For the restricted hypotheses $\mathcal{H}_r$, we used the prior/posterior trick outlined above; and indeed, we found strong evidence in favour of the notion that, for example, Hufflepuffs score highest on Agreeableness. Curious about Slytherin and the other Houses? You can read the published paper with all the details <a href="https://www.collabra.org/article/10.1525/collabra.240/">here</a>.</p>
<h1 id="conclusion">Conclusion</h1>
<p>Participating in a relaxing prediction contest, we saw how three subjective Bayesians named Ron, Harry, and Hermione formalized their beliefs and derived their predictions about the likely outcome of ten coin flips. By restricting her prior beliefs about the bias of the coin to exclude values smaller than $\theta = 0.50$, Hermione was the boldest in her predictions and was ultimately rewarded. However, if the outcome of the coin flips would have turned out differently, say $y = 2$, then Hermione would have immediately realized how wrong her beliefs were. I think we as scientists need to be more like Hermione: we need to make more precise predictions, allowing us to construct more powerful tests and “fail” in insightful ways.</p>
<p>We also saw a neat trick by which one can compute the Bayes factor in favour of a restricted model compared to an unrestricted model by estimating the proportion of prior and posterior values of the parameter that are in line with the restriction — no painstaking computation of marginal likelihoods required! We used this trick to find evidence for what we all knew deep in our hearts already: Hufflepuffs are <em>so</em> agreeable.</p>
<hr />
<p><em>I would like to thank Sophia Crüwell and Lea Jakob for helpful comments on this blog post.</em></p>
<hr />
<h2 id="references">References</h2>
<ul>
<li>Klugkist, I., Kato, B., & Hoijtink, H. (<a href="https://onlinelibrary.wiley.com/doi/full/10.1111/j.1467-9574.2005.00279.x">2005</a>). Bayesian model selection using encompassing priors. <em>Statistica Neerlandica, 59</em>(1), 57-69.</li>
<li>Wetzels, R., Grasman, R. P., & Wagenmakers, E. J. (<a href="https://www.sciencedirect.com/science/article/pii/S0167947310001180">2010</a>). An encompassing prior generalization of the Savage–Dickey density ratio. <em>Computational Statistics & Data Analysis, 54</em>(9), 2094-2102.</li>
<li>Etz, A., Haaf, J. M., Rouder, J. N., & Vandekerckhove, J. (<a href="https://journals.sagepub.com/doi/full/10.1177/2515245918773087">2018</a>). Bayesian inference and testing any hypothesis you can specify. <em>Advances in Methods and Practices in Psychological Science, 1</em>(2), 281-295.</li>
<li>Crysel, L. C., Cook, C. L., Schember, T. O., & Webster, G. D. (<a href="https://www.sciencedirect.com/science/article/abs/pii/S0191886915002615">2015</a>). Harry Potter and the measures of personality: Extraverted Gryffindors, agreeable Hufflepuffs, clever Ravenclaws, and manipulative Slytherins. <em>Personality and Individual Differences, 83</em>, 174-179.</li>
<li>Jakob, L., Garcia-Garzon, E., Jarke, H., & Dablander, F. (<a href="https://www.collabra.org/article/10.1525/collabra.240/">2019</a>). The Science Behind the Magic? The Relation of the Harry Potter “Sorting Hat Quiz” to Personality and Human Values. <em>Collabra: Psychology, 5</em>(1).</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>The analytical solution is <a href="https://www.wolframalpha.com/input/?i=Integral%5Btheta%5Ey+*+%281+-+theta%29%5E%28n+-+y%29%2C+theta%2C+0.50%2C+1%5D">unpleasant</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>You can discover your Hogwarts House affiliation at <a href="https://www.pottermore.com/">https://www.pottermore.com/</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderIf you are reading this, you are probably a Ravenclaw. Or a Hufflepuff. Certainly not a Slytherin … but maybe a Gryffindor? In this blog post, we let three subjective Bayesians predict the outcome of ten coin flips. We will derive prior predictions, evaluate their accuracy, and see how fortune favours the bold. We will also discover a neat trick that allows one to easily compute Bayes factors for models with parameter restrictions compared to models without such restrictions, and use it to answer a question we truly care about: are Slytherins really the bad guys? Preliminaries As in a previous blog post, we start by studying coin flips. Let $\theta \in [0, 1]$ be the bias of the coin and let $y$ denote the number of heads out of $n$ coin flips. We use the Binomial likelihood and a Beta prior for $\theta$: This prior is conjugate for this likelihood which means that the posterior is again a Beta distribution. The Figure below shows two examples of this. In this blog post, we will use a prior predictive perspective on model comparison by means of Bayes factors. For an extensive contrast with a perspective based on posterior prediction, see this blog post. The Bayes factor indicates how much better a model $\mathcal{M}_1$ predicts the data $y$ relative to another model $\mathcal{M}_0$: where we can write the marginal likelihood of a generic model $\mathcal{M}$ more complicatedly to see the dependence on the model’s priors: After these preliminaries, in the next section, we visit Ron, Harry, and Hermione in Hogwarts. The Hogwarts prediction contest Ron, Harry, and Hermione just came back from a straining adventure — Death Eaters and all. They deserve a break, and Hermione suggests a small prediction contest to relax. Ron is put off initially; relaxing by thinking? That’s not his style. Harry does not care either way; both are eventually convinced. The goal of the contest is to accuratly predict the outcome of $n = 10$ coin flips. Luckily, this is not a particularly complicated problem to model, and we can use the Binomial likelihood we have discussed above. In the next section, Ron, Harry, and Hermione — all subjective Bayesians — clearly state their prior beliefs which is required to make predictions. Prior beliefs Ron is not big on thinking, and so trusts his previous intuitions that coins are usually unbiased; he specifies a point mass on $\theta = 0.50$ as his prior. Harry spreads his bets evenly, and believes that all chances governing the coin flip’s outcome are equally likely; he puts a uniform prior on $\theta$. Hermione, on the other hand, believes that the coin cannot be biased towards tails; instead, she believes that all values $\theta \in [0.50, 1]$ are equally likely. She thinks this because Dobby — the house elf — is the one who throws the coin, and she has previously observed him passing time by flipping coins, which strangely almost always landed up heads. To sum up, their priors are: which are visualized in the Figure below. In the next section, the three use their beliefs to make probabilistic predictions. Prior predictions Ron, Harry, and Hermione are subjective Bayesians and therefore evaluate their performance by their respective predictive accuracy. Each of the trio has a prior predictive distribution. For Ron, true to character, this is the easiest to derive. We associate model $\mathcal{M}_0$ with him and write: where the integral — the sum! — is just over the value $\theta = 0.50$. Ron’s prior predictive distribution is simply a Binomial distribution. He is delighted by this fact, and enjoys a short rest while the others derive their predictions. It is Harry’s turn, and he is a little put off by his integration problem. However, he realizes that the integrand is an unnormalized Beta distribution, and swiftly writes down its normalizing constant, the Beta function. Associating $\mathcal{M}_1$ with him, his steps are: which is a Beta-Binomial distribution with $\alpha = \beta = 1$. Hermione’s integral is the most complicated of the three, but she is also the smartest of the bunch. She is a master of the wizardry that is computer programming, which allows her to solve the integral numerically.1 We associate $\mathcal{M}_r$, which stands for restricted model, with her and write: We can draw from the prior predictive distributions by simulating from the prior and then making predictions through the likelihood. For Hermione, for example, this yields: The analytical solution is unpleasant. ↩Love affairs and linear differential equations2019-08-29T11:00:00+00:002019-08-29T11:00:00+00:00https://fabiandablander.com/r/Linear-Love<p>Differential equations are a powerful tool for modeling how systems change over time, but they can be a little hard to get into. Love, on the other hand, is humanity’s perennial topic; some even claim it is all you need. In this blog post — inspired by Strogatz (1988, 2015) — I will introduce linear differential equations as a means to study the types of love affairs two people might find themselves in.</p>
<p>Do opposites attract? What happens to a relationship if lovers are out of touch with their own feelings? We will answer these and other questions using two coupled linear differential equations. On our journey, we will use graphical as well as mathematical methods to classify the types of relationships this modeling framework can accommodate. In a follow-up blog post, we will also play around with non-linear terms and add a third wheel to the mix, which can lead to chaos — in the technical sense of the term, of course. Excited? Then let’s get started!</p>
<h1 id="introducing-romeo">Introducing Romeo</h1>
<blockquote>
A lovestruck Romeo sang the streets of serenade <br />
Laying everybody low with a love song that he made <br />
Finds a streetlight, steps out of the shade <br />
Says something like, "You and me, babe, how about it?"
</blockquote>
<p>Romeo is quite the emotional type. Let $R(t)$ denote his feelings at time point $t$. Following common practice, we will usually write $R$ instead of $R(t)$, making the time dependence implicit. The process which describes how Romeo’s feelings change is rather simple: it depends only on Romeo’s current feelings. We write:</p>
<script type="math/tex; mode=display">\frac{\mathrm{d}R}{\mathrm{d}t} = aR \enspace ,</script>
<p>which is a linear differential equation. Note that this <em>implicitly</em> encodes how Romeo’s feelings change over time, since when we know $R$ at time point $t$, we can compute the direction and speed with which $R$ will change — the derivative denotes velocity. Our goal, however, is to find an explicit, closed-form expression for Romeo’s feelings at time point $t$. In this particular case, we can do this analytically:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= aR \\[.5em]
\frac{1}{aR}\mathrm{d}R &= \mathrm{dt} \\[.5em]
\frac{1}{a}\int \frac{1}{R}\mathrm{d}R &= \int \mathrm{dt} \\[.5em]
\frac{1}{a} \left[\text{log} \, R + C \right] &= t \\[.5em]
\text{log} \, R &= a t - C \\[.5em]
R &= e^{at - C} \enspace .
\end{aligned} %]]></script>
<p>A differential equation describes how something changes; to kickstart the process, we need an initial condition $R_0$. This allows us to find the constant of integration $C$. In particular, assume that $R = R_0$ at $t = 0$, which leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R_0 &= e^{-C} \\[.5em]
\text{log} \, R_0 &= -C \enspace ,
\end{aligned} %]]></script>
<p>such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R &= e^{at + \text{log} \, R_0} \\[.5em]
R &= R_0 e^{at} \enspace .
\end{aligned} %]]></script>
<p>The left two panels of the figure below visualize how Romeo’s feelings change over time for $a > 0$ with initial condition $R_0 = 1$ (top) or $R_0 = -1$ (bottom). The right two panels show how his feelings change for $a < 0$ with $R_0 = 100$ (top) or $R_0 = -100$ (bottom).</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-1-1.png" title="plot of chunk unnamed-chunk-1" alt="plot of chunk unnamed-chunk-1" style="display: block; margin: auto;" /></p>
<p>We conclude: Romeo is a simple guy, albeit with an emotion regulation problem. When the object of his affection is such that $a > 0$, his feelings will either grow exponentially towards mad love if he starts out with a positive first impression ($R_0 > 0$), or grow exponentially towards mad hatred if he starts out with a negative first impression ($R_0 < 0$). On the other hand, if $a < 0$, then regardless of his initial feelings, they will decay exponentially towards indifference.</p>
<p>For $R_0 = 0$, Romeo’s feelings never change. For any other initial condition, we have uhindered, exponential growth when $a > 0$; it never stops. For any other initial condition and $a < 0$, we crash down to zero very rapidly. Thus $R = 0$ is a <em>fixed point</em> in both cases, which is <em>stable</em> for $a < 0$ but becomes <em>unstable</em> if $a > 0$. We can visualize this in <em>phase space</em> on a line. The phase space is filled with all possible trajectories because each point can serve as the initial condition.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-2-1.png" title="plot of chunk unnamed-chunk-2" alt="plot of chunk unnamed-chunk-2" style="display: block; margin: auto;" /></p>
<p>In the next section, a wonderful new episode in Romeo’s life begins: he meets Juliet.</p>
<h1 id="introducing-juliet">Introducing Juliet</h1>
<blockquote>
Juliet says, "Hey, it's Romeo, you nearly gave me a heart attack" <br />
He's underneath the window, she's singing, "Hey, la, my boyfriend's back <br />
You shouldn't come around here singing up at people like that <br />
Anyway, what you gonna do about it?"
</blockquote>
<p>Life becomes more complicated for Romeo now that Juliet is in his life. It is their first real relationship, and they have much to learn. We start simple. Let $J$ denote Juliet’s feelings for Romeo, and let $R$ denote Romeo’s feelings for Juliet. We can extend our single linear differential equation from above to a system of two linear differential equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= aR\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= dJ \enspace .
\end{aligned} %]]></script>
<p>Using the results from above, the solutions to the two differential equations are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R(t) &= R_0 e^{at} \\[.5em]
J(t) &= J_0 e^{dt} \enspace ,
\end{aligned} %]]></script>
<p>where $R(t)$ and $J(t)$ give the trajectories of love for Romeo and Juliet, respectively, and $J_0$ is Juliet’s initial feeling towards Romeo at $t = 0$. In contrast to the one-dimensional phase diagram from above, we now have a two-dimensional picture which is known as a <em>vector field</em>.</p>
<p>Analogously to the case of a single differential equation, $a < 0$ and $d < 0$ imply exponential decay for Romeo and Juliet’s love, and $a > 0$ and $d > 0$ imply exponential growth. The left figure below visualizes decay: whatever the initial state of their love, it will crash into the origin of indifference. The figure on ther right visualizes growth: whatever the initial state, except for indifference, their feelings will grow exponentially and eventually consume them.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-4-1.png" title="plot of chunk unnamed-chunk-4" alt="plot of chunk unnamed-chunk-4" style="display: block; margin: auto;" /></p>
<p>This can result in happy, ever increasing love if they start out liking each other (top right quadrant), but can also result in an increasingly violent feud if they start out disliking each other (bottom left quadrant). For asymmetric starts, one of them will be hopelessly in love with the other, while the other’s hate grows unboundedly. The fixed point (0, 0) is <em>stable</em> on the left, as any tiny perturbation will move the system towards it. In contrast, the fixed point on the right is <em>unstable</em>, as any ounce of love or hate, no matter how small, will make the system explode. One unfortunate subtlety arises, however: if Romeo loves Juliet, but Juliet is indifferent, then Juliet will forever stay indifferent even though Romeo’s love grows without bound.</p>
<p>Another interesting case occurs when their affection is asymmetric, i.e., $a \neq d$. The figure below on the left shows one such case for negative parameters: we see that whatever feelings Juliet has for Romeo, they decay faster then the feelings Romeo has for Juliet. Moreover, since $(a, d) < 0$, the origin is stable. The figure on the right shows a more impactful asymmetry: Romeo’s feelings decay ($a < 0$), but Juliet’s increase ($d > 0$). Regardless of what the initial feelings of Romeo are, he will always end up in a state of indifference with respect to Juliet (all arrows point to the y-axis). Juliet, on the other hand, will go increasingly mad with love or hate, depending on her initial feelings — the exception being if she starts out with indifference ($J_0 = 0$): then she will stay indifferent. This type of fixed point is called a <em>saddle point</em>, which occurs if there is one vector along which the system is stable (here the x-axis) and one vector along which the system is unstable (here the y-axis).</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-5-1.png" title="plot of chunk unnamed-chunk-5" alt="plot of chunk unnamed-chunk-5" style="display: block; margin: auto;" /></p>
<p>What happens if Romeo’s feelings never change, i.e., $a = 0$? This is visualized as the figure on the left below: Romeo’s feelings will always stay at the initial point. Juliet’s feelings decrease ($d < 0$), so regardless of where she is, the system will end up at a stable fixed point on the x-axis. A similar situation occurs if Juliet’s feelings never change, and Romeo’s feelings decay ($a < 0$), which is visualized in the figure on the right: all points on the y-axis are stable fixed points. If instead the moving parties’ feelings would increase instead of decay, the fixed points would be unstable. The most boring case is $a = 0$ and $d = 0$, because then every point on the plane is a fixed point: however the two lovers start, they will never change.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Note that in all the love affairs described above, the feelings of Romeo and Juliet are actually <em>independent of each other</em>. They <em>do not communicate</em> with each other, and <em>we all know that communication is key</em>! In the next section, Romeo and Juliet’s relationship matured and they start taking each other seriously. Formally, we <em>couple</em> the two love birds and analyze what types of love this can set free.</p>
<h1 id="coupled-differential-equations">Coupled differential equations</h1>
<p>In the previous section, we saw that the behaviour of the system was determined entirely by the values of $a$ and $d$ — depending on whether $a$ or $d$ were positive, negative, or zero, the system would either be stable or unstable along the $R$ or $J$ dimension. Incorporating communication complicates the system, but is ultimately for the better. To model the fact that Romeo and Juliet now respond to each other’s feelings, we simply write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= aR + bJ\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= cR + dJ \enspace ,
\end{aligned} %]]></script>
<p>or in matrix form:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix}
\frac{\mathrm{d}R}{\mathrm{d}t} \\
\frac{\mathrm{d}J}{\mathrm{d}t}
\end{pmatrix} &= \begin{pmatrix} a & b \\ c & d \end{pmatrix} \begin{pmatrix} R \\ J\end{pmatrix} \\[.5em]
\dot{\mathbf{x}} &= A \mathbf{x} \enspace .
\end{aligned} %]]></script>
<p>The classification of such a system is more difficult. In the next section, we introduce one type of relationship between a matured Romeo and Juliet that will motivate a general solution to coupled differential equations.</p>
<h2 id="the-saddle-of-love">The Saddle of love</h2>
<blockquote>
I might not be the right one <br />
It might not be the right time <br />
But there's something about us I've got to do <br />
Some kind of secret I will share with you
</blockquote>
<p>In a previous life, Juliet and Romeo did not communicate ($b = c = 0$) but listened to their own feelings in opposite ways ($a = -1$ and $d = 1$). Here we include communication:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= -2R + J\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= -R + 2J \enspace .
\end{aligned} %]]></script>
<p>Specifically, Romeo dampens his feelings the more strongly he feels ($a = -2$) and listens to Juliet such that whichever way her feelings go, Romeo’s follow suit ($b = 1$). Juliet does the opposite: she increases her feelings of love or hate the more strongly she feels ($d = 2$), and responds to Romeo such that whichever way his feelings go, Juliet’s feelings move the other way ($c = -1$). In a sense, Romeo and Juliet are opposites — can any good come from this?</p>
<p>Before answering this question, we first find a general solution to systems of linear differential equations. This gives us a way to formally classify any (linear) relationships between Romeo and Juliet. The solution will involve eigenvectors and eigenvalues, so let’s put our sleeves up and get to work!</p>
<h2 id="solving-coupled-differential-equations">Solving coupled differential equations</h2>
<p>In contrast to the first system of linear equations above where Romeo and Juliet did not communicate with each other, the system now is <em>coupled</em>: Romeo’s feelings influence Juliet’s and vice versa. Now, if their feelings would instead be independent, then the solution to the differential equations would be easy: just as above, their respective feelings would either grow or decay exponentially. The dependence between their feelings is encoded in the matrix $A$. If $A$ were diagonal, then the equations would be independent.</p>
<p>The solution to our problem thus presents itself: somehow, we must manage to make the matrix $A$ diagonal. We can do this by changing basis, a trick we have also used in deriving a <a href="https://fabiandablander.com/r/Fibonacci.html">closed-form expression of the Fibonacci numbers</a> in a previous blog post. If you are unfamiliar with these ideas, it might pay to read the previous blog post before proceeding.</p>
<p>Assuming that $A$ is <em>diagonalizable</em> (more on that latter), we can write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A &= E \Lambda E^{-1} \\[.5em]
\begin{pmatrix} a & b \\ c & d \end{pmatrix} &= \begin{pmatrix} \mathbf{v}_1 & \mathbf{v}_2\end{pmatrix}\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix} \begin{pmatrix} \mathbf{v}_1 & \mathbf{v}_2\end{pmatrix}^{-1} \enspace ,
\end{aligned} %]]></script>
<p>where $(\lambda_1, \lambda_2)$ are the eigenvalues of $A$ and $\mathbf{v}_1$ and $\mathbf{v}_2$ are the respective eigenvectors. Conceptually, multiplying a vector with $E^{-1}$ changes its basis from the standard basis to the basis of eigenvectors. In this space, the matrix encoding the dependence between our two differential equations is the diagonal matrix of eigenvalues $\Lambda$ — the two differential equations are independent! We know that in this space the solution to the differential equations are independent exponential functions. However, we have to change back to our standard basis, and we do so by multiplying with $E$.</p>
<p>With this insight, our system of differential equation becomes:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\dot{\mathbf{x}} &= E \Lambda E^{-1} \mathbf{x} \\
E^{-1} \dot{\mathbf{x}} &= \Lambda E^{-1} \mathbf{x} \\
\dot{\mathbf{u}} &= \Lambda \mathbf{u} \enspace ,
\end{aligned} %]]></script>
<p>where we have defined $\mathbf{u} = E^{-1}\mathbf{x}$, which is now with respect to the eigenbasis. Now since:</p>
<script type="math/tex; mode=display">% <![CDATA[
\Lambda = \begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix} \enspace , %]]></script>
<p>the solution to the two differential equations is:</p>
<script type="math/tex; mode=display">\mathbf{u} = \begin{pmatrix} C_1e^{\lambda_1 t} \\ C_2e^{\lambda_2 t} \end{pmatrix} \enspace ,</script>
<p>where $C_1$ and $C_2$ are the constants of integration which we earlier denoted as $R_0$ and $J_0$. To change back basis, we multiply with $E$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{x} &= E \mathbf{u} \\[.5em]
\mathbf{x} &= \begin{pmatrix} \mathbf{v}_1 & \mathbf{v}_2 \end{pmatrix} \begin{pmatrix} C_1e^{\lambda_1 t} \\ C_2e^{\lambda_2 t} \end{pmatrix} \\[.5em]
\mathbf{x} &= \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \enspace ,
\end{aligned} %]]></script>
<p>where $\mathbf{v}_1$ and $\mathbf{v}_2$ are eigenvectors and $\lambda_1$ and $\lambda_2$ are the corresponding eigenvalues. Therefore, solving a system of ordinary linear differential equations reduces to finding the eigenvalues and eigenvectors of the matrix $A$.</p>
<h2 id="finding-eigenvalues-and-eigenvectors">Finding eigenvalues and eigenvectors</h2>
<p>An eigenvector of a matrix is a vector that is only stretched by the matrix by a factor of $\lambda$, such that for $v \neq 0$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A\mathbf{v} &= \lambda \mathbf{v} \\[.5em]
(A - I\lambda) \mathbf{v} &= 0 \enspace ,
\end{aligned} %]]></script>
<p>which is true when the determinant of $(A - I\lambda)$ is zero, that is, $\left\vert A - I\lambda\right\vert = 0$. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\left\vert\begin{pmatrix} a & b \\ c & d\end{pmatrix} - \begin{pmatrix} \lambda & 0 \\ 0 & \lambda \end{pmatrix}\right\vert &= 0 \\[1em]
\left\vert\begin{pmatrix} a - \lambda & b \\ c & d - \lambda \end{pmatrix}\right\vert &= 0 \\[1em]
(a - \lambda)(d - \lambda) - bc &= 0 \\[1em]
\lambda^2 - \lambda(a + d) - ad + bc &= 0 \enspace .
\end{aligned} %]]></script>
<p>We define:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\tau &\equiv \text{trace}(A) = a + d \\[.5em]
\Delta &\equiv \vert A\vert = ad - bc \enspace ,
\end{aligned} %]]></script>
<p>and recall the quadratic formula to find both eigenvalues:</p>
<script type="math/tex; mode=display">\lambda = \frac{\tau \pm \sqrt{\tau^2 - 4\Delta}}{2} \enspace .</script>
<p>In the next section, we apply this to the “saddle of love” differential equation in order to better understand the trajectories Romeo and Juliet’s love could take.</p>
<h2 id="solving-the-saddle-of-love">Solving the saddle of love</h2>
<p>Recall that we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= -2R + J\\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= -R + 2J \enspace .
\end{aligned} %]]></script>
<p>such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -2 & 1 \\ -1 & 2\end{pmatrix} \enspace . %]]></script>
<p>For our saddle of love, the eigenvalues are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda &= \frac{0 \pm \sqrt{0 - 4\cdot(-3)}}{2} = \frac{\pm \sqrt{4 \cdot 3}}{2} = \pm \sqrt{3} \enspace .
\end{aligned} %]]></script>
<p>To find the first eigenvector, we compute for the first eigenvalue $\lambda_1 = \sqrt{3}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
(A - I\lambda_1)\mathbf{v}_1 &= 0 \\[.5em]
\begin{pmatrix} -2 - \sqrt{3} & -1 \\ 1 & 2 - \sqrt{3} \end{pmatrix} \begin{pmatrix} v_1 \\ v_2\end{pmatrix} &= \begin{pmatrix} 0 \\ 0 \end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>which has solution $\mathbf{v}_1 = (1, 2 + \sqrt{3})^T$. For $\lambda_2 = -\sqrt{3}$, the eigenvector is $\mathbf{v}_2 = (1, 2 - \sqrt{3})^T$. We can verify this in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## eigen() decomposition
## $values
## [1] -1.732051 1.732051
##
## $vectors
## [,1] [,2]
## [1,] -0.9659258 -0.2588190
## [2,] -0.2588190 -0.9659258</code></pre></figure>
<p>which scales the eigenvectors to have unit length by dividing by its norm, and in this case also multiplies by $-1$; this does not matter, as eigenvectors are only defined up to a constant factor.</p>
<p>Plugging the eigenvalues and eigenvectors into our general solution form yields:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\mathbf{x} &= \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \\[.5em]
\mathbf{x} &= \begin{pmatrix} 1 \\ 2 + \sqrt{3} \end{pmatrix} C_1e^{\sqrt{3} t} + \begin{pmatrix} 1 \\ 2 - \sqrt{3} \end{pmatrix} C_2e^{-\sqrt{3}t} \enspace .
\end{aligned} %]]></script>
<p>We still need to solve for the constants $C_1$ and $C_2$. Assume that at $t = 0$, the feelings for Romeo and Juliet are $\mathbf{x} = (1, 1)^T$. Then we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} 1 \\ 1 \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ 2 + \sqrt{3} & 2 - \sqrt{3} \end{pmatrix} \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} \\[.5em]
\begin{pmatrix} 1 & 1 \\ 2 + \sqrt{3} & 2 - \sqrt{3} \end{pmatrix}^{-1}\begin{pmatrix} 1 \\ 1 \end{pmatrix} &= \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} \\[.5em]
\begin{pmatrix} 0.21 \\ 0.79 \end{pmatrix} &= \begin{pmatrix} C_1 \\ C_2 \end{pmatrix} \enspace ,
\end{aligned} %]]></script>
<p>which yields the following solutions for Romeo and Juliet:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
R(t) &= 0.21 \cdot e^{\sqrt{3} t} + 0.79 \cdot e^{-\sqrt{3} t} \\[.5em]
J(t) &= 0.79 \cdot e^{\sqrt{3} t} + 0.21 \cdot e^{-\sqrt{3} t} \enspace .
\end{aligned} %]]></script>
<p>Note how this result differs from when Romeo and Juliet did not communicate: the solution is a linear combination of two exponentials — the two lovebirds are clearly coupled! Now that we have seen one worked example, the code below computes the trajectory of Romeo and Juliet for an arbitrary matrix $A$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">solve_linear</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">tmax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">500</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># compute eigenvectors and eigenvalues</span><span class="w">
</span><span class="n">eig</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eig</span><span class="o">$</span><span class="n">vectors</span><span class="w">
</span><span class="n">lambdas</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eig</span><span class="o">$</span><span class="n">values</span><span class="w">
</span><span class="c1"># solve for the initial conditon</span><span class="w">
</span><span class="n">C</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">inits</span><span class="w">
</span><span class="c1"># create time steps</span><span class="w">
</span><span class="n">ts</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tmax</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">t</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ts</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="p">(</span><span class="n">C</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">exp</span><span class="p">(</span><span class="n">lambdas</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="n">t</span><span class="p">))</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="c1"># Re drops the imaginary part ... more on that later!</span><span class="w">
</span><span class="nf">Re</span><span class="p">(</span><span class="n">x</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The code for visualizing vector fields for two coupled linear differential equations is given below.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'fields'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_vector_field</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">.50</span><span class="p">)</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">-4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">.50</span><span class="p">)</span><span class="w">
</span><span class="n">RJ</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">as.matrix</span><span class="p">(</span><span class="n">expand.grid</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w">
</span><span class="n">dRJ</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">A</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">t</span><span class="p">(</span><span class="n">RJ</span><span class="p">))</span><span class="w">
</span><span class="n">plot</span><span class="p">(</span><span class="w">
</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">,</span><span class="w"> </span><span class="n">type</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="n">axes</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="kc">FALSE</span><span class="p">,</span><span class="w">
</span><span class="n">xlab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">ylab</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">''</span><span class="p">,</span><span class="w"> </span><span class="n">main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">title</span><span class="p">,</span><span class="w"> </span><span class="n">cex.main</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">arrow.plot</span><span class="p">(</span><span class="w">
</span><span class="n">RJ</span><span class="p">,</span><span class="w"> </span><span class="n">dRJ</span><span class="p">,</span><span class="w">
</span><span class="n">arrow.ex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.1</span><span class="p">,</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.05</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'gray82'</span><span class="w">
</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">3.9</span><span class="p">,</span><span class="w"> </span><span class="m">-.2</span><span class="p">,</span><span class="w"> </span><span class="s1">'R'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">text</span><span class="p">(</span><span class="m">-.2</span><span class="p">,</span><span class="w"> </span><span class="m">3.9</span><span class="p">,</span><span class="w"> </span><span class="s1">'J'</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.25</span><span class="p">,</span><span class="w"> </span><span class="n">font</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-4</span><span class="p">,</span><span class="w"> </span><span class="m">4</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">4</span><span class="p">,</span><span class="w"> </span><span class="m">-4</span><span class="p">),</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>Before we visualize the vector field, let me again stress that the solution to a system of two coupled linear differential equation is of the form:</p>
<script type="math/tex; mode=display">\mathbf{x} = \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \enspace .</script>
<p>The eigenvectors coincide with the standard basis vectors when the two differential equations are independent, as was the case above when Romeo and Juliet did not communicate. In such cases, exponential growth or decay is along the standard basis vectors, i.e., the x- and y-axes. For the case we are considering now, this is not the true — the eigenvectors are different from the standard basis vectors. It therefore makes sense to visualize the eigenvectors, as they are in some sense more fundamental to the solution. However, we want to retain the interpretability of the standard basis, as this is our reference frame for the initial condition. In the following visualizations, therefore, we add the eigenvectors which makes it apparent exactly in which directions there is exponential growth or decay.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">plot_eigenvectors</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="o">$</span><span class="n">vectors</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="m">4</span><span class="w">
</span><span class="n">arrows</span><span class="p">(</span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">],</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="n">arrows</span><span class="p">(</span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="o">-</span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">E</span><span class="p">[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">],</span><span class="w"> </span><span class="n">length</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">lty</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">...</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">add_line</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">lines</span><span class="p">(</span><span class="n">solve_linear</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">inits</span><span class="p">),</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'red'</span><span class="p">,</span><span class="w"> </span><span class="n">lwd</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.5</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">-2</span><span class="p">,</span><span class="w"> </span><span class="m">-1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">))</span><span class="w">
</span><span class="n">plot_vector_field</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">title</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'The Saddle of Love'</span><span class="p">)</span><span class="w">
</span><span class="n">plot_eigenvectors</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">.75</span><span class="p">))</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">add_line</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="o">-</span><span class="nf">c</span><span class="p">(</span><span class="m">3.5</span><span class="p">,</span><span class="w"> </span><span class="m">.75</span><span class="p">))</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1.7</span><span class="p">)</span><span class="w">
</span><span class="n">points</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">cex</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">2.2</span><span class="p">,</span><span class="w"> </span><span class="n">col</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'white'</span><span class="p">,</span><span class="w"> </span><span class="n">pch</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<p>The figure above visualizes the resulting vector field, the standard basis (solid lines), the eigenvectors (dashed lines), and four example trajectories (red lines). The eigenvectors define different quadrants than the standard basis. If Romeo and Juliet start in the top right or top left eigenquadrant, then their love grows exponentially. If they start in the bottom left or bottom right eigenquadrant, their hate grows exponentially. Note that we again have a saddle point, as there is exponential decay along one eigenvector and exponential growth along the other; only if Romeo and Juliet’s initial feelings are exactly on the decaying eigenvector do we end up in a state of indifference.</p>
<!-- An interesting case is if Juliet starts out positive while Romeo has initial feelings of hate, but not too much so that they are in the bottom left eigenquadrant, their love grows eternally. This makes sense: Romeo downweights his own feelings ($b = -2$) and is positively influenced by Juliet's love ($d = 1$). -->
<!-- On the other hand, if Juliet starts out negative and Romeo starts with love, they increasingly hate each other. This is again reasonable, as Romeo downplays his own positive feelings and "takes over" Juliet's negative ones. So, too, is their fate when both start out with hate. -->
<p>In the next section, we go beyond the saddle of love and study what different matrices $A$ imply for the stability landscape of love affairs.</p>
<h1 id="a-classification-of-linear-systems">A classification of linear systems</h1>
<p>All we need to know to classify the relationship between Romeo and Juliet is the trace $\tau = a + d$ and the determinant $\Delta = ad - bc$ of the matrix $A$. We can rewrite these in terms of eigenvalues:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda_1 + \lambda_2 &= \frac{1}{2}\left(\tau + \sqrt{\tau^2 - 4\Delta}\right) + \frac{1}{2}\left(\tau - \sqrt{\tau^2 - 4\Delta}\right) = \tau \\[.5em]
\lambda_1 \lambda_2 &= \frac{1}{2}\left(\tau + \sqrt{\tau^2 - 4\Delta}\right)\frac{1}{2}\left(\tau - \sqrt{\tau^2 - 4\Delta}\right) \\[.5em]
&= \frac{1}{4} \left(\tau^2 - \tau^2 + 4\Delta\right) \\[.5em]
&= \Delta \enspace ,
\end{aligned} %]]></script>
<p>which means that we can characterize a linear system solely by its eigenvalues. If $\lambda_1 < 0$ we have exponential decay and if $\lambda_1 > 0$ we have exponential growth in the direction of the first eigenvector, $\mathbf{v}_1$. The same holds for $\lambda_2$ and $\mathbf{v}_2$.</p>
<h2 id="keepin-it-real-attracting-and-repelling-nodes">Keepin’ it real: Attracting and repelling nodes</h2>
<blockquote>
No it ain't no use in callin' out my name, gal <br />
Like you never done before <br />
And it ain't no use in callin' out my name, gal <br />
I can't hear ya any more.
</blockquote>
<p>If $\tau^2 - 4\Delta > 0$, both eigenvalues are real. If both are negative, then the origin is an attracting fixed point; if they are positive, the origin is a repelling fixed point. As an example, take this matrix:</p>
<script type="math/tex; mode=display">% <![CDATA[
A_{1} = \begin{pmatrix} -1 & 0.50 \\ 1 & -1\end{pmatrix} \enspace , %]]></script>
<p>which means that Romeo downplays his feelings as strongly as Juliet, but is influenced only half as strongly by Juliet’s feeling as Juliet is by his feelings.</p>
<p>The matrix:</p>
<script type="math/tex; mode=display">% <![CDATA[
A_{2} = \begin{pmatrix} 1 & 0.50 \\ 0.25 & 0.50 \end{pmatrix} \enspace , %]]></script>
<p>shows what both Romeo and Juliet reinforce each other’s feeings ($b = 0.50$ and $c = 0.25$) as well as their own ($a = 1$ and d = $0.50$). We know from above that this cannot be mathematically stable!</p>
<p>The figure on the left below shows that indifference is the result of the relationship govenered by $A_1$, regardless of the starting point. Nodes generally have a slow and a fast eigendirection; the larger the eigenvalue, the stronger the pull in the direction of the corresponding eigenvector. For the stable node on the left, the fast eigendirection is clearly given by the negative eigenvector — all trajectories are strongly pulled in its direction; only gradually are they pulled in the other eigendirection until they end up at the origin.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-12-1.png" title="plot of chunk unnamed-chunk-12" alt="plot of chunk unnamed-chunk-12" style="display: block; margin: auto;" /></p>
<p>The figure on the right shows the relationship governed by $A_2$, which yields a more tumultuous love affair. In particular, Romeo and Juliet always have opposite feelings toward each other that also grow exponentially: Romeo becomes madder and madder in love with Juliet while Juliet becomes more and more hateful towards him, or the reverse — it doesn’t matter how loud one of them calls the other, there will be no positive response. The fast eigendirection is now given by the positive eigenvector; all trajectories initially go up (or down) a bit, before they get pulled heavily in the eigenvector’s direction, moving almost parallel to it.</p>
<p>In both the above cases, the eigenvalues are distinct. This allows one eigendirection to be slow and the other fast. In the next section, we look at what happens when both eigenvalues are equal.</p>
<h2 id="one-dimensional-love">One-dimensional love</h2>
<blockquote>
Ah, now I don't hardly know her <br />
But I think I could love her <br />
Crimson and clover
</blockquote>
<p>If $\tau^2 - 4\Delta = 0$, the matrix $A$ does not have distinct eigenvalues. We can distinguish two cases. First, as in our very first example, we could have:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} \lambda & 0 \\ 0 & \lambda \end{pmatrix} \enspace , %]]></script>
<p>which yields a <em>star node</em>: all directions point either to the origin ($\lambda < 0$) or away from it ($\lambda > 0$). We have visualized this vector field for $\lambda = -1$ and $\lambda = 1$ when Romeo met Juliet, so we do not visualize it here. In this case, $A$ is <em>diagonalizable</em>, that is, we can find matrices $\Lambda$ and $E$ such that:</p>
<script type="math/tex; mode=display">A = E \Lambda E^{-1} \enspace .</script>
<p>To see this in R, assume that $\lambda = -1$. The following could should give us $A$ back.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">check_decomposition</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">eig</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">eig</span><span class="o">$</span><span class="n">vectors</span><span class="w">
</span><span class="n">Lambda</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="n">eig</span><span class="o">$</span><span class="n">values</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">Lambda</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">A</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="o">-</span><span class="n">diag</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">check_decomposition</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1] [,2]
## [1,] -1 0
## [2,] 0 -1</code></pre></figure>
<p>For $A$ to be diagonalizable, we require that $E$, the matrix of eigenvectors, is invertible. A matrix is invertible if it is <em>full rank</em>, which requires that the eigenvectors be independent, that is, they must span the plane. This brings us to the second case. Assume that:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -1 & -1 \\ 0 & -1 \end{pmatrix} \enspace . %]]></script>
<p>Then the two eigenvalues are again equal, but <em>the eigenvectors are not independent</em>. We can still compute the eigendecomposition in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">A</span><span class="p">[</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">-1</span><span class="w">
</span><span class="n">eigen</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## eigen() decomposition
## $values
## [1] -1 -1
##
## $vectors
## [,1] [,2]
## [1,] 1 1.000000e+00
## [2,] 0 2.220446e-16</code></pre></figure>
<p>The only eigenvector is $\mathbf{v}_1 = (1, 0)^T$, even though R tells us that there are two distinct ones due to numerical imprecision. If we were to diagonalize the matrix, we would get an error, since $E$ is <em>singular</em>, that is, not invertible:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">check_decomposition</span><span class="p">(</span><span class="n">A</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## Error in solve.default(E): system is computationally singular: reciprocal condition number = 1.11022e-16</code></pre></figure>
<p>We can, however, still visualize the vector field. We now have a <em>degenerate node</em> in which all trajectories are parallel to the eigenvector (which in this case is the x-axis, since $\mathbf{v}_1 = (1, 0)^T$).</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-16-1.png" title="plot of chunk unnamed-chunk-16" alt="plot of chunk unnamed-chunk-16" style="display: block; margin: auto;" /></p>
<p>While we can plot the vector field, we cannot use our diagonalization trick to compute a closed-form solution, since we cannot invert $E$. We could use numerical methods to compute trajectories; I will discuss this in more detail in a follow-up post on nonlinear differential equations for which we generally cannot get a closed-form expression. However, we can get such an expression for linear systems even if $A$ is not diagonalizable by using <em>matrix exponentials</em>. Since this would take us a little too far here, I defer this treatment to the <em>Post Scriptum</em>.</p>
<p>In the next two sections, we complete our classification of linear systems by allowing Romeo and Juliet’s love to oscillate.</p>
<h2 id="spiralling-love">Spiralling love</h2>
<blockquote>
Sometimes I feel so happy <br />
Sometimes I feel so sad <br />
Sometimes I feel so happy <br />
But mostly you just make me mad <br />
Baby, you just make me mad
</blockquote>
<p>Observe that:</p>
<script type="math/tex; mode=display">\lambda = \frac{\tau}{2} \pm \frac{\sqrt{\tau^2 - 4\Delta}}{2} \enspace ,</script>
<p>which will be complex if $\tau^2 - 4\Delta < 0$. We rewrite the eigenvalues slightly:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda &= \frac{\tau}{2} \pm \frac{\sqrt{\tau^2 - 4\Delta}}{2} \\[.5em]
&= \frac{\tau}{2} \pm \frac{\sqrt{-1}\sqrt{4\Delta - \tau^2}}{2} \\[.5em]
&= \alpha \pm i\omega \enspace ,
\end{aligned} %]]></script>
<p>where $\alpha = \tau / 2$ and $\omega = \sqrt{4\Delta - \tau^2} / 2$. The solution to the system of differential equation is still of the form:</p>
<script type="math/tex; mode=display">\mathbf{x} = \mathbf{v}_1 C_1e^{\lambda_1 t} + \mathbf{v}_2 C_2e^{\lambda_2 t} \enspace .</script>
<p>However, the $\lambda$’s are now complex which results in:</p>
<script type="math/tex; mode=display">e^{\lambda t} = e^{(\alpha \pm i \omega)t} = e^{\alpha t} e^{\pm i\omega t} = e^{\alpha t} \left[\text{cos}(\omega t) + i \cdot \text{sin}(\omega t) \right] \enspace .</script>
<p>For $\alpha < 0$ and $\omega \neq 0$ we have <em>dampened oscillations</em>: they decay exponentially. For $\alpha > 0$ and $\omega \neq 0$ we have <em>amplifying oscillations</em>: they grow exponentially. To see this visually, let’s take the matrix</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -0.20 & -1 \\ 1 & 0\end{pmatrix} \enspace . %]]></script>
<p>This implies that Romeo dampens his own feelings slightly ($a = -0.10$) and feels more love when Juliet hates him and more hate if Juliet loves him ($b = -1$). On the other hand, Juliet does not listen to her own feelings ($d = 0$) and mimicks Romeo’s feelings ($c = 1$). Where does this lead the two love birds?</p>
<p>The figure below on the left visualizes the vector field and one trajectory of love. The figure on the right visualizes Romeo and Juliet’s trajectory separately.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-17-1.png" title="plot of chunk unnamed-chunk-17" alt="plot of chunk unnamed-chunk-17" style="display: block; margin: auto;" /></p>
<p>Although both lovers start at mutual affection, over the course of their relationship, they feel happy, then sad, then happy, then sad, until they don’t feel anymore. If, on the other hand, we change $A$ to</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} 0.10 & -1 \\ 1 & 0\end{pmatrix} \enspace , %]]></script>
<p>we have $\alpha = 0.05$ which is positive. This implies slower growth than we had decay before ($\alpha = -0.10$). If we allow both lovers only an ounce of mutual affection $(0.1, 0.1)$, they will spiral forever, they feelings always growing, always changing.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-18-1.png" title="plot of chunk unnamed-chunk-18" alt="plot of chunk unnamed-chunk-18" style="display: block; margin: auto;" /></p>
<p>I encourage you to play around with the code a bit to get an intuition for these things. In the next section, we look at a special case of this linear system before we wrap-up.</p>
<h2 id="the-circle-of-love">The circle of love</h2>
<blockquote>
Oh, so long, Marianne <br />
It's time that we began to laugh <br />
And cry and cry and laugh about it all again. <br />
</blockquote>
<p>An interesting special case of the spiral of love occurs when $\alpha = 0$ such that all eigenvalues are imaginary. As an example, let $a = 0$, $b = -1$, $c = 1$, and $d = 0$ such that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}R}{\mathrm{d}t} &= -J \\[.5em]
\frac{\mathrm{d}J}{\mathrm{d}t} &= R \enspace .
\end{aligned} %]]></script>
<p>Romeo and Juliet do not listen to their own feelings anymore, but only to their partner’s feelings. However, they do so in opposite ways. For Romeo, this model implies that when Juliet’s feelings for him are high ($J > 0$), Romeo’s feelings for Juliet <em>decrease</em>. If they are low ($J < 0$), then his feelings <em>increase</em>. For Juliet, it is exactly the opposite: when Romeo’s feelings are strong ($R > 0$), her feelings <em>increase</em>, while when his feelings wane ($R < 0$), her feelings <em>decrease</em>. Is this a (mathematically) stable relationship? To find out, we visualize the vector field below as well as three love trajectories.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-19-1.png" title="plot of chunk unnamed-chunk-19" alt="plot of chunk unnamed-chunk-19" style="display: block; margin: auto;" /></p>
<p>Romeo and Juliet are stuck in a never ending circle! Regardless of the starting point, they will be prisoners to the Sisyphean circle of love which will make them laugh and cry and cry and laugh about it all again. Except, of course, when they start at the origin $(0, 0)$: if they start with indifference, they will forever stay indifferent. Note that the fixed point is now called a <em>center</em> which is <em>neutrally stable</em>, since nearby trajectories are neither attracted nor repelled from the fixed point.</p>
<p>We have started and ended our journey of relationships with two extremes: ignoring the other’s feelings and ignoring one’s own. Both are unhealthy. <em>Communication is key</em>. In the next section, we recap the types of linear systems we have seen in this blog post.</p>
<h1 id="classification-recap">Classification recap</h1>
<blockquote>
She took off a silver locket <br />
She said remember me by this <br />
She put her hand in my pocket <br />
I got a keepsake and a kiss
</blockquote>
<p>The figure below summarizes the classification of linear systems we have, step by step, developed in this blog post (see also Strogatz, 2015, p. 140).</p>
<!-- <div style="text-align:center;"> -->
<!-- <img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" /> -->
<!-- </div> -->
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-20-1.png" title="plot of chunk unnamed-chunk-20" alt="plot of chunk unnamed-chunk-20" style="display: block; margin: auto auto auto 0;" /></p>
<p>If both $\tau = 0$ and $\Delta = 0$, the eigenvalues are zero and the solution is a constant: Romeo and Juliet’s feelings will forever stay wherever they started — we have a plane of fixed points. If $\tau = 0$ but $\Delta \neq 0$, either Romeo or Juliet’s feelings are constant, and the other person’s feelings either exponentially grow or decay — we have a line of fixed points.</p>
<p>Saddle points occur when $\tau \neq 0$ and $\Delta < 0$, which implies that one eigenvalue is positive and the other is negative, that is, we have exponentially growth in one eigendirection and exponential decay in the other; the fixed point $(0, 0)$ is generally unstable, except when the initial condition is exactly on the vector along we which there is exponential decay.</p>
<p>If $\tau = 0$ and $\Delta = 0$ all eigenvalues are imaginary, resulting in a <em>center</em> — the circle of love. These become <em>spirals</em> if $\tau \neq 0$, since the eigenvalues now have a real part which results in amplifying oscillations ($\tau > 0$) or dampened oscillations ($\tau < 0$).</p>
<p>On the parabola described by $\tau^2 - 4\Delta = 0$ we have repeated eigenvalues. If the resulting eigenvectors are independent, we have a <em>star node</em> in which all directions either point towards the origin ($\lambda < 0$) or away from it ($\lambda > 0$).</p>
<p>If the resulting eigenvectors are not independent, we have a <em>degenerate node</em>; we cannot invert the matrix of eigenvectors anymore and thus need to use other methods. One such method is provided by matrix exponentials — see the <em>Post Scriptum</em>.</p>
<p>Above the parabola, we either have <em>stable nodes</em> for $\tau < 0$ and <em>unstable nodes</em> for $\tau > 0$.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<h1 id="conclusion">Conclusion</h1>
<blockquote>
When you can fall for chains of silver, you can fall for chains of gold <br />
You can fall for pretty strangers and the promises they hold <br />
You promised me everything, you promised me thick and thin, yeah <br />
Now you just say "Oh, Romeo, yeah, you know I used to have a scene with him"
</blockquote>
<p>In this blog post, we have seen that linear differential equations are a powerful tool to model how systems change over time in general, and how the love affair between two lovebirds can evolve in particular. We have started out with an isolated Romeo whose feelings either exponentially grow or decay. Romeo then met Juliet, and we have extended the single differential equation to a system of two equations to accommodate this life event.</p>
<p>Love affairs can take many shapes and forms. We have classified those depending on their stability landscape, and seen that linear differential equations can be solved in closed-form by using eigenvectors and eigenvalues or matrix exponentials. In a follow-up blog post, Romeo and Juliet’s love will overcome the shackles of linearity, and we end up with nonlinear differential equations. This will make for more intriguing relationships. We will also add a third lover and study how the dynamics change — it might get chaotic!</p>
<hr />
<p>I would like to thank <a href="https://ryanoisin.github.io/">Oisín Ryan</a> for discussion as well as extensive and very helpful comments on this blog post.</p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<h3 id="solving-differential-equations-using-matrix-exponentials">Solving differential equations using matrix exponentials</h3>
<!-- <blockquote> -->
<!-- There must be some kind of way outta here <br> -->
<!-- Said the joker to the thief <br> -->
<!-- There's too much confusion <br> -->
<!-- I can't get no relief -->
<!-- </blockquote> -->
<p>Recall that the solution to the single linear differential equation $\frac{\mathrm{d}x}{\mathrm{d}t} = ax$ is:</p>
<script type="math/tex; mode=display">x(t) = x_0 e^{at} \enspace .</script>
<p>The series expansion of $e$ is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
e^{at} &= 1 + at + \frac{(at)^2}{2!} + \frac{(at)^3}{3!} + \ldots \\[.5em]
&= \sum_{k=0}^{\infty} \frac{(at)^k}{k!} \enspace .
\end{aligned} %]]></script>
<p>The idea is to generalize this to allow for a matrix in the exponent. In particular, analogously to the one-dimensional case, we want the system</p>
<script type="math/tex; mode=display">\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} = A\mathbf{x} \enspace ,</script>
<p>to have solutions of the form:</p>
<script type="math/tex; mode=display">\mathbf{x}(t) = \mathbf{x}_0e^{At} \enspace .</script>
<p>First, we generalize the series expansion of $e$ to matrices:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
e^{At} &= I + At + \frac{(At)^2}{2!} + \frac{(At)^3}{3!} + \ldots \\[.5em]
&= \sum_{k=0}^{\infty} \frac{t^k}{k!} A^k \enspace ,
\end{aligned} %]]></script>
<p>where</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
A^0 &= I \\[.5em]
A^k &= \underbrace{A \cdot A \cdot \ldots \cdot A}_{\text{k times}} \enspace .
\end{aligned} %]]></script>
<p>With this definition, we assume that $\mathbf{x} = \mathbf{x}_0 e^{At}$ and check whether it is true that:</p>
<script type="math/tex; mode=display">\frac{\mathrm{d}\mathbf{x}}{\mathrm{d}t} = \frac{\mathrm{d}\mathbf{x}_0e^{At}}{\mathrm{d}t} = A \mathbf{x} = A \mathbf{x}_0 e^{At} \enspace .</script>
<p>Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{\mathrm{d}\mathbf{x}_0e^{At}}{\mathrm{d}t} &= \mathbf{x}_0 \left(0 + A + \frac{2A^2 t}{2!} + \frac{3A^3t^2}{3!} + \ldots\right) \\[.5em]
&= \mathbf{x}_0 \left(A + A^2t + \frac{A^3t^2}{2!} + \ldots\right) \\[.5em]
&= \mathbf{x}_0 A\left(I + At + \frac{A^2t^2}{2!} + \ldots \right) \\[.5em]
&= \mathbf{x}_0 A e^{At} \\[.5em]
&= A \mathbf{x}_0 e^{At} \\[.5em]
&= A \mathbf{x} \enspace ,
\end{aligned} %]]></script>
<p>which shows that, indeed, the matrix exponential of $A$ is a solution to a system of linear differential equations!</p>
<!-- Why do we care? Our motivating example was that we cannot use the eigen decomposition to solve a system of linear differential equations when the eigenvectors are not independent, since the resulting matrix is not invertible. Using the matrix exponential, however, there is no mention of eigenvectors. -->
<p>The matrix exponential solution <em>generalizes</em> the solution using eigendecomposition to non-diagonal matrices $A$. For a diagonalizable matrix $A$, we can connect the approach of using the <a href="https://en.wikipedia.org/wiki/Matrix_exponential">matrix exponential</a> to solve a system of linear differential equations to the eigendecomposition approach we have discussed above. Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
e^{A} &= E e^{\Lambda} E^{-1} \\[.5em]
&= E \begin{pmatrix} e^{\lambda_1} & 0 \\ 0 & e^{\lambda_2}\end{pmatrix} E^{-1} \enspace ,
\end{aligned} %]]></script>
<p>that is, by noting that the matrix exponential of a diagonal matrix given by simply exponentiating each element. This is then the solution in the eigenbasis, which we transform back by multiplying with $E$, as we have done earlier. For diagonalizable matrices, this is a very convenient way of computing the matrix exponential. For general matrices, this is not possible and one needs to rely on other ways of computing the matrix exponential (see Moler & Van Loan, 2003).</p>
<p>To return to our initial problem: we want an expression for the solution of the system described by:</p>
<script type="math/tex; mode=display">% <![CDATA[
A = \begin{pmatrix} -1 & -1 \\ 0 & -1 \end{pmatrix} \enspace \enspace , %]]></script>
<p>in order to easily compute the trajectory of Romeo and Juliet’s feelings. Assuming that $\mathbf{x}_0 = (1, 1)$, the solution to the system is:</p>
<script type="math/tex; mode=display">\mathbf{x}(t) = \begin{pmatrix} 1 \\ 1\end{pmatrix} e^{At} \enspace ,</script>
<p>which we can implement straightforwardly in R.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'expm'</span><span class="p">)</span><span class="w">
</span><span class="n">solve_linear2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">A</span><span class="p">,</span><span class="w"> </span><span class="n">inits</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">tmax</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">20</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1000</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="c1"># create time steps</span><span class="w">
</span><span class="n">ts</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">tmax</span><span class="p">,</span><span class="w"> </span><span class="n">length.out</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncol</span><span class="p">(</span><span class="n">A</span><span class="p">))</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">t</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ts</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w">
</span><span class="n">x</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expm</span><span class="p">(</span><span class="n">A</span><span class="o">*</span><span class="n">t</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">inits</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">x</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The figure below visualizes a few trajectories of this system that were hithertho uncomputable using the eigendecomposition.</p>
<p><img src="/assets/img/2019-08-29-Linear-Love.Rmd/unnamed-chunk-22-1.png" title="plot of chunk unnamed-chunk-22" alt="plot of chunk unnamed-chunk-22" style="display: block; margin: auto;" /></p>
<!-- [Strogatz mentions](https://youtu.be/QrHRaA93Nrg?list=PLbN57C5Zdl6j_qJA-pARJnKsmROzPnO9V&t=4404) that such degenerate nodes are rather unlikely in the real world, and that fits with our story since $d = 0$ implies that Juliet does not listen to her heart, which contradicts our assumption that she has matured as a lover. -->
<hr />
<h2 id="references">References</h2>
<ul>
<li>Strogatz, S. H. (<a href="https://www.tandfonline.com/doi/abs/10.1080/0025570X.1988.11977342">1988</a>). Love affairs and differential equations. <em>Mathematics Magazine, 6</em>1(1), 35-35.</li>
<li>Strogatz, S. H. (<a href="http://www.stevenstrogatz.com/books/nonlinear-dynamics-and-chaos-with-applications-to-physics-biology-chemistry-and-engineering">2015</a>). Nonlinear Dynamics and Chaos: With applications to Physics, Biology, Chemistry, and Engineering. Colorado, US: Westview Press.</li>
<li>Nonlinear Dynamics and Chaos Lectures by Steven Strogatz, especially <a href="https://www.youtube.com/watch?v=QrHRaA93Nrg&list=PLbN57C5Zdl6j_qJA-pARJnKsmROzPnO9V&index=5">Lecture 5</a>.</li>
<li>Ryan, O., Kuiper, R. M., & Hamaker, E. L. (<a href="https://link.springer.com/chapter/10.1007/978-3-319-77219-6_2">2018</a>). A continuous time approach to intensive longitudinal data: What, Why and How? In K. v. Montfort, J. H. L. Oud, & M. C. Voelkle (Eds.), <em>Continuous time modeling in the behavioral and related sciences</em>. New York: Springer.</li>
<li>Moler, C., & Van Loan, C. (<a href="https://epubs.siam.org/doi/abs/10.1137/S00361445024180?casa_token=ROT7WzzdP14AAAAA:qedJ1cEiWWcPbjq42eSdeKk7LhoAcJYx4eahw3txUDckZS0QCOJhCXaH2nSsuBViH_i8YwBwxQ">2003</a>). Nineteen dubious ways to compute the exponential of a matrix, twenty-five years later. <em>SIAM review, 45</em>(1), 3-49.</li>
</ul>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>We have used the love affair between Romeo and Juliet to motivate the classification of a system of two linear differential equations. This was the main goal of the blog post. With this classification in mind, however, one could now study love affairs from a more “substantive” point of view; see Strogatz (1988) and Strogatz (2015, p. 143). <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderDifferential equations are a powerful tool for modeling how systems change over time, but they can be a little hard to get into. Love, on the other hand, is humanity’s perennial topic; some even claim it is all you need. In this blog post — inspired by Strogatz (1988, 2015) — I will introduce linear differential equations as a means to study the types of love affairs two people might find themselves in. Do opposites attract? What happens to a relationship if lovers are out of touch with their own feelings? We will answer these and other questions using two coupled linear differential equations. On our journey, we will use graphical as well as mathematical methods to classify the types of relationships this modeling framework can accommodate. In a follow-up blog post, we will also play around with non-linear terms and add a third wheel to the mix, which can lead to chaos — in the technical sense of the term, of course. Excited? Then let’s get started! Introducing Romeo A lovestruck Romeo sang the streets of serenade Laying everybody low with a love song that he made Finds a streetlight, steps out of the shade Says something like, "You and me, babe, how about it?" Romeo is quite the emotional type. Let $R(t)$ denote his feelings at time point $t$. Following common practice, we will usually write $R$ instead of $R(t)$, making the time dependence implicit. The process which describes how Romeo’s feelings change is rather simple: it depends only on Romeo’s current feelings. We write: which is a linear differential equation. Note that this implicitly encodes how Romeo’s feelings change over time, since when we know $R$ at time point $t$, we can compute the direction and speed with which $R$ will change — the derivative denotes velocity. Our goal, however, is to find an explicit, closed-form expression for Romeo’s feelings at time point $t$. In this particular case, we can do this analytically: A differential equation describes how something changes; to kickstart the process, we need an initial condition $R_0$. This allows us to find the constant of integration $C$. In particular, assume that $R = R_0$ at $t = 0$, which leads to: such that: The left two panels of the figure below visualize how Romeo’s feelings change over time for $a > 0$ with initial condition $R_0 = 1$ (top) or $R_0 = -1$ (bottom). The right two panels show how his feelings change for $a < 0$ with $R_0 = 100$ (top) or $R_0 = -100$ (bottom). We conclude: Romeo is a simple guy, albeit with an emotion regulation problem. When the object of his affection is such that $a > 0$, his feelings will either grow exponentially towards mad love if he starts out with a positive first impression ($R_0 > 0$), or grow exponentially towards mad hatred if he starts out with a negative first impression ($R_0 < 0$). On the other hand, if $a < 0$, then regardless of his initial feelings, they will decay exponentially towards indifference. For $R_0 = 0$, Romeo’s feelings never change. For any other initial condition, we have uhindered, exponential growth when $a > 0$; it never stops. For any other initial condition and $a < 0$, we crash down to zero very rapidly. Thus $R = 0$ is a fixed point in both cases, which is stable for $a < 0$ but becomes unstable if $a > 0$. We can visualize this in phase space on a line. The phase space is filled with all possible trajectories because each point can serve as the initial condition. In the next section, a wonderful new episode in Romeo’s life begins: he meets Juliet. Introducing Juliet Juliet says, "Hey, it's Romeo, you nearly gave me a heart attack" He's underneath the window, she's singing, "Hey, la, my boyfriend's back You shouldn't come around here singing up at people like that Anyway, what you gonna do about it?" Life becomes more complicated for Romeo now that Juliet is in his life. It is their first real relationship, and they have much to learn. We start simple. Let $J$ denote Juliet’s feelings for Romeo, and let $R$ denote Romeo’s feelings for Juliet. We can extend our single linear differential equation from above to a system of two linear differential equations: Using the results from above, the solutions to the two differential equations are: where $R(t)$ and $J(t)$ give the trajectories of love for Romeo and Juliet, respectively, and $J_0$ is Juliet’s initial feeling towards Romeo at $t = 0$. In contrast to the one-dimensional phase diagram from above, we now have a two-dimensional picture which is known as a vector field.The Fibonacci sequence and linear algebra2019-07-28T13:30:00+00:002019-07-28T13:30:00+00:00https://fabiandablander.com/r/Fibonacci<p>Leonardo Bonacci, better known as Fibonacci, has influenced our lives profoundly. At the beginning of the $13^{th}$ century, he introduced the Hindu-Arabic numeral system to Europe. Instead of the Roman numbers, where <strong>I</strong> stands for one, <strong>V</strong> for five, <strong>X</strong> for ten, and so on, the Hindu-Arabic numeral system uses position to index magnitude. This leads to much shorter expressions for large numbers.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p>
<p>While the history of the <a href="https://thonyc.wordpress.com/2017/02/10/the-widespread-and-persistent-myth-that-it-is-easier-to-multiply-and-divide-with-hindu-arabic-numerals-than-with-roman-ones/">numerical system</a> is fascinating, this blog post will look at what Fibonacci is arguably most well known for: the <em>Fibonacci sequence</em>. In particular, we will use ideas from linear algebra to come up with a closed-form expression of the $n^{th}$ Fibonacci number<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. On our journey to get there, we will also gain some insights about recursion in R.<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup></p>
<h1 id="the-rabbit-puzzle">The rabbit puzzle</h1>
<p>In Liber Abaci, Fibonacci poses the following question (paraphrasing):</p>
<blockquote>
<p>Suppose we have two newly-born rabbits, one female and one male. Suppose these rabbits produce another pair of female and male rabbits after one month. These newly-born rabbits will, in turn, also mate after one month, producing another pair, and so on. Rabbits never die. How many pairs of rabbits exist after one year?</p>
</blockquote>
<p>The Figure below illustrates this process. Every point denotes one rabbit pair over time. To indicate that every newborn rabbit pair needs to wait one month before producing new rabbits, rabbits that are not fertile yet are coloured in grey, while rabbits ready to procreate are coloured in red.</p>
<div style="text-align:center;">
<img src="../assets/img/Fibonacci-Rabbits.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="620" height="720" />
</div>
<p>We can derive a linear recurrence relation that describes the Fibonacci sequence. In particular, note that rabbits never die. Thus, at time point $n$, all rabbits from time point $n - 1$ carry over. Additionally, we know that every fertile rabbit pair will produce a new rabbit pair. However, they have to wait one month, so that the amount of fertile rabbits equals the amount of rabbits at time point $n - 2$. Resultingly, the Fibonacci sequence {$F_n$}$_{n=1}^{\infty}$ is:</p>
<script type="math/tex; mode=display">F_n = F_{n-1} + F_{n-2} \enspace ,</script>
<p>for $n \geq 3$ and $F_1 = F_2 = 1$. Before we derive a closed-form expression that computes the $n^{th}$ Fibonacci number directly, in the next section, we play around with alternative, more straightforward solutions in R.</p>
<h1 id="implementation-in-r">Implementation in R</h1>
<p>We can write a wholly inefficient, but beautiful program to compute the $n^{th}$ Fibonacci number:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">ifelse</span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-2</span><span class="p">))</span></code></pre></figure>
<p>R takes roughly 5 seconds to compute the $30^{\text{th}}$ Fibonacci number; computing the $40^{\text{th}}$ number exhausts my patience. This recursive solution is not particularly efficient because R executes the function an unnecessary amount of times. For example, the call tree for <em>fib(5)</em> is:</p>
<ul>
<li><em>fib(5)</em></li>
<li><em>fib(4)</em> + <em>fib(3)</em></li>
<li>(<em>fib(3)</em> + <em>fib(2)</em>) + (<em>fib(2)</em> + <em>fib(1)</em>)</li>
<li>((<em>fib(2)</em> + <em>fib(1)</em>) + (<em>fib(1)</em> + <em>fib(0)</em>)) + ((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>)</li>
<li>((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>) + (<em>fib(1)</em> + <em>fib(0)</em>)) + ((<em>fib(1)</em> + <em>fib(0)</em>) + <em>fib(1)</em>)</li>
</ul>
<p>which shows that <em>fib(2)</em> was called three times. This is not necessary, as we can store the outcome of this function call instead of recomputing it every time. This technique is called <a href="https://en.wikipedia.org/wiki/Memoization">memoization</a> (see also the R package <a href="https://github.com/r-lib/memoise">memoise</a>). Implementing this leads to:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_mem</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cache</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">list</span><span class="p">()</span><span class="w">
</span><span class="n">inside</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">%in%</span><span class="w"> </span><span class="nf">as.numeric</span><span class="p">(</span><span class="nf">names</span><span class="p">(</span><span class="n">cache</span><span class="p">))</span><span class="w">
</span><span class="n">fib</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">2</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="nf">return</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w"> </span><span class="k">else</span><span class="w"> </span><span class="k">if</span><span class="w"> </span><span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">inside</span><span class="p">(</span><span class="n">n</span><span class="p">)))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">cache</span><span class="p">[[</span><span class="nf">as.character</span><span class="p">(</span><span class="n">n</span><span class="p">)]]</span><span class="w"> </span><span class="o"><<-</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-1</span><span class="p">)</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="m">-2</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">cache</span><span class="p">[[</span><span class="nf">as.character</span><span class="p">(</span><span class="n">n</span><span class="p">)]]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">fib</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>This computes the $1000^{th}$ Fibonacci in a tenth of a second. We can, of course, write this sequentially, and also store all intermediate Fibonacci numbers. This also avoids memory issues brought about by the recursive implementation. Interestingly, although this algorithm seems like it should be $O(n)$, it is actually $O(n^2)$ since we are adding increasingly large numbers (for more on this, see <a href="https://catonmat.net/linear-time-fibonacci">here</a>).</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_seq</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">num</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">3</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">num</span><span class="p">[</span><span class="n">i</span><span class="m">-2</span><span class="p">]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">num</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The first 30 Fibonacci numbers are: 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657, 46368, 75025, 121393, 196418, 317811, 514229, 832040.</p>
<p>This is a rapid increase, as made apparent by the left Figure below. The Figure on the right shows that there is structure in how the sequence grows.</p>
<p><img src="/assets/img/2019-07-28-Fibonacci.Rmd/unnamed-chunk-8-1.png" title="plot of chunk unnamed-chunk-8" alt="plot of chunk unnamed-chunk-8" style="display: block; margin: auto;" /></p>
<p>We will return to the structure in growth at the end of the blog post. First, we need to derive a closed-form expression of the $n^{th}$ Fibonacci number. In the next section, we take a step towards that by realizing that diagonal matrices make for easier computations.</p>
<h1 id="diagonal-matrices-are-good">Diagonal matrices are good</h1>
<p>Our goal is to get a closed form expression of the $n^{th}$ Fibonacci number. The first thing to note is that, due to linear recursion, we can view the Fibonacci numbers as applying a linear map. In particular, define $T \in \mathcal{L}(\mathbb{R}^2)$ by:</p>
<script type="math/tex; mode=display">T(x, y) = (y, x + y) \enspace .</script>
<p>We note that:</p>
<script type="math/tex; mode=display">T^n(0, 1) = (F_n, F_{n+1}) \enspace ,</script>
<p>which we will prove by induction. In particular, note that the base case $n = 1$:</p>
<script type="math/tex; mode=display">T^1(0, 1) = (1, 0 + 1) = (1, 1) = (F_1, F_2) \enspace ,</script>
<p>does in fact give the first two Fibonacci numbers. Now for the induction step: we assume that this holds for an arbitrary $n$, and we show that it holds for $n + 1$ using the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T^n(0, 1) &= (F_n, F_{n+1}) \\[1em]
T(T^n(0, 1)) &= T(F_n, F_{n+1}) \\[1em]
T^{n+1}(0, 1) &= (F_{n+1}, F_n + F_{n+1}) \\[1em]
T^{n+1}(0, 1) &= (F_{n+1}, F_{n+2}) \enspace .
\end{aligned} %]]></script>
<p>The last equality follows from the definition of the Fibonacci sequence, i.e., the fact that any number is equal to the sum of the previous two numbers. The matrix of this linear map with respect to the standard basis is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
A \equiv \mathcal{M}(T) = \begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix} \enspace , %]]></script>
<p>since $T(1, 0) = (0, 1)$ and $T(0, 1) = (1, 1)$. Observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} = \begin{pmatrix} y \\ x + y \end{pmatrix} \enspace . %]]></script>
<p>In the sequential R code for computing the Fibonacci numbers, we have applied the linear map $n$ times, which gave us the Fibonacci number we were interested in. We can write this in matrix form:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>If you were to compute, say, the $3^{th}$ Fibonacci number using this equation, you would have to multiply $A$ three times with itself. Now assume you had something like:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Using the above equation, the matrix powers would become trivial:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{pmatrix}^n = \begin{pmatrix} \lambda_1^n & 0 \\ 0 & \lambda_2^n \end{pmatrix} \enspace . %]]></script>
<p>There would be no need to repeatedly engage in matrix multiplication; instead, we would arrive at the $n^{th}$ Fibonacci number using only scalar multiplication! Our task is thus as follows: find a new matrix for the linear map which is diagonal. To solve this, we will need eigenvalues and eigenvectors.</p>
<h1 id="finding-eigenvalues-and-eigenvectors">Finding eigenvalues and eigenvectors</h1>
<p>An eigenvector-eigenvalue pair $(v, \lambda)$ satisfies for $v \neq 0$ that:</p>
<script type="math/tex; mode=display">Tv = \lambda v \enspace ,</script>
<p>which means that for a particular vector $v$, the linear map only stretches the vector by a constant $\lambda$. Here’s the key: using the eigenvectors as basis, the matrix of the linear map is diagonal. This is because the matrix of our linear map, $A$, is defined by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Tv_1 &= A_{11} v_1 + A_{21} v_2 \\
Tv_2 &= A_{12} v_1 + A_{22} v_2 \enspace .
\end{aligned} %]]></script>
<p>Now since the basis consists only of eigenvectors, we know that $Tv_1 = \lambda v_1$ and $Tv_2 = \lambda v_2$, which implies that $A_{11} = \lambda_1$ and $A_{21} = 0$, as well as $A_{12} = 0$ and $A_{22} = \lambda_2$. For a wonderful explanation of eigenvalues and eigenvectors, see <a href="https://www.youtube.com/watch?v=PFDu9oVAE-g">this video</a> by 3Blue1Brown.<sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup></p>
<p>In order to find the eigenvalues and eigenvectors, note that the linear map satisfies the following two equations:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
T(x, y) &= \lambda (x, y) \\[1em]
T(x, y) &= (y, x + y) \enspace .
\end{aligned} %]]></script>
<p>This leads to:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda x &= y \\[1em]
\lambda y &= x + y \enspace .
\end{aligned} %]]></script>
<p>We substitute the first expression into the second one, yielding:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda^2 x &= x + y \\[1em]
(\lambda^2 - 1)x &= y \enspace ,
\end{aligned} %]]></script>
<p>which we now substitute into the first equation, which results in:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda x &= (\lambda^2 - 1)x\\[1em]
0 &= \lambda^2 - \lambda - 1\enspace .
\end{aligned} %]]></script>
<p>We can now apply the <em>quadratic formula</em> or “Mitternachtsformel”, as it is called in parts of Germany because students should know the formula when they are roused from sleep at midnight. We are neither in Germany, nor is it midnight, nor can I actually remember the formula, so let’s quickly derive it for our problem:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lambda^2 - \lambda - 1 &= 0 \\[1em]
\lambda^2 - \lambda &= 1 \\[1em]
4\lambda^2 - 4\lambda &= 4 \\[1em]
4\lambda^2 - 4\lambda + 1&= 4 + 1 \\[1em]
(2\lambda - 1)^2&= 4 + 1 \\[1em]
2\lambda - 1 &= \pm \sqrt{4 + 1} \\[1em]
\lambda &= \frac{1 \pm \sqrt{5}}{2} \enspace .
\end{aligned} %]]></script>
<p>Now that we have found both eigenvalues, we go hunting for the eigenvectors! We put the eigenvalue into the equations from above:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\frac{1 \pm \sqrt{5}}{2} x &= y \\[1em]
\frac{1 \pm \sqrt{5}}{2} y &= x + y \enspace .
\end{aligned} %]]></script>
<p>If we set $x = 1$, then $y = \frac{1 \pm \sqrt{5}}{2}$. Thus, two eigenvectors are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
v_1 &= \left(1, \frac{1 + \sqrt{5}}{2}\right) \\[1em]
v_2 &= \left(1, \frac{1 - \sqrt{5}}{2}\right) \enspace .
\end{aligned} %]]></script>
<p>As a sanity check to see whether this is indeed true, we check whether $Tv_1 = \lambda_1 v_1$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
Tv_1 &= \left(\frac{1 + \sqrt{5}}{2}, 1 + \frac{1 + \sqrt{5}}{2}\right) \\[1em]
\lambda v_1 &= \frac{1 + \sqrt{5}}{2} \left(1, \frac{1 + \sqrt{5}}{2}\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \left(\frac{1 + \sqrt{5}}{2}\right)^2\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \frac{1 + 2\sqrt{5} + 5}{4}\right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, \frac{3}{2} + \frac{\sqrt{5}}{2} \right) \\[1em]
&= \left(\frac{1 + \sqrt{5}}{2}, 1 + \frac{1 + \sqrt{5}}{2} \right) \enspace ,
\end{aligned} %]]></script>
<p>which shows that the two expression are equal. Moreover, the dot product of the two eigenvectors is zero, which means that the two eigenvectors are linearly independent (as they should be). In the next section, we will find that <a href="https://en.wikipedia.org/wiki/Map%E2%80%93territory_relation">the same territory can be described by different maps</a>.</p>
<h1 id="change-of-basis">Change of basis</h1>
<p>Now that we have found the eigenvalues and eigenvectors, we can create the matrix $D$ of the linear map $T$ which is diagonal with respect to the basis of eigenvectors:</p>
<script type="math/tex; mode=display">% <![CDATA[
D = \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \enspace . %]]></script>
<p>We are not done yet, however. Note that $D$ is the matrix of the linear map $T$ with respect to the basis that consists of both eigenvectors $v_1$ and $v_2$, <em>not</em> with respect to the standard basis. We have changed our coordinate system — our map — as indicated by the Figure below; the black coloured vectors are the standard basis vectors while the vectors coloured in red are our new basis vectors.</p>
<!-- <div style = "float: left; padding: 10px 10px 10px 0px;"> -->
<!-- ![](/assets/img/change-of-basis.png) -->
<div style="text-align:center;">
<img src="../assets/img/change-of-basis.png" align="center" style="padding-top: 10px; padding-bottom: 10px;" />
</div>
<!-- </div> -->
<p>To build some intuition, let’s play around with representing $\omega$ in both the standard basis and our new eigenbasis. Any vector is a linear combination of the basis vectors. Let $a_1$ and $a_2$ be the coefficients for the standard basis such that:</p>
<script type="math/tex; mode=display">\omega = a_1 \begin{pmatrix} 1 \\ 0 \end{pmatrix} + a_2 \begin{pmatrix} 0 \\ 1 \end{pmatrix} \enspace .</script>
<p>Now because I have drawn it earlier, I know that $a_1 = -1$ and $a_2 = 0.3$. This is the representation of $\omega$ in the standard basis. How do we represent it in our eigenbasis? Well, using the eigenbasis the vector $\omega$ is still a linear combination of the basis vectors, but with different coefficients; denote them as $b_1$ and $b_2$. We thus have:</p>
<script type="math/tex; mode=display">\omega = b_1 \begin{pmatrix} 1 \\ \frac{1 + \sqrt{5}}{2} \end{pmatrix} + b_2 \begin{pmatrix} 1 \\ \frac{1 - \sqrt{5}}{2} \end{pmatrix} = a_1 \begin{pmatrix} 1 \\ 0 \end{pmatrix} + a_2 \begin{pmatrix} 0 \\ 1 \end{pmatrix} \enspace .</script>
<p>If we write this in matrix form, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} b_1 \\ b_2 \end{pmatrix} &= \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix}\\[1em]
\begin{pmatrix} b_1 \\ b_2 \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^{-1} \begin{pmatrix} 1 & 0 \\ 0 & 1 \end{pmatrix} \begin{pmatrix} a_1 \\ a_2 \end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>Thus, we can represent a vector $a$ with basis $S$ in our new basis $E$ by computing:</p>
<script type="math/tex; mode=display">b = E^{-1} S \, a \enspace .</script>
<p>In our eigenbasis, the vector $\omega$ has the coordinates:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">lambda1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">lambda2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="p">(</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="m">2</span><span class="w">
</span><span class="n">S</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="m">2</span><span class="p">)</span><span class="w">
</span><span class="n">a</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">-1</span><span class="p">,</span><span class="w"> </span><span class="m">.3</span><span class="p">)</span><span class="w">
</span><span class="n">E</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cbind</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda1</span><span class="p">),</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda2</span><span class="p">))</span><span class="w">
</span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">S</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">a</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] -0.1422291
## [2,] -0.8577709</code></pre></figure>
<p>This means we have the representation:</p>
<script type="math/tex; mode=display">\omega = -0.14 \begin{pmatrix} 1 \\ \frac{1 + \sqrt{5}}{2} \end{pmatrix} - 0.86 \begin{pmatrix} 1 \\ \frac{1 - \sqrt{5}}{2} \end{pmatrix} \enspace ,</script>
<p>which makes intuitive sense when you look at the Figure above. For another beautiful linear algebra video by 3Blue1Brown, this time about changing bases, see <a href="https://www.youtube.com/watch?v=P2LTAUO1TdA&t=598s">here</a>. In the next section, we will use what we have learned above to express the $n^{th}$ Fibonacci number in closed-form.</p>
<h1 id="closed-form-fibonacci">Closed-form Fibonacci</h1>
<p>Recall from above that our solution to finding the $n^{th}$ Fibonacci number in matrix form is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} 0 & 1 \\ 1 & 1\end{pmatrix}^n \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Now, we have swapped the non-diagonal matrix $A$ with the diagonal matrix $D$ by changing the basis from the standard basis to the eigenbasis. However, the vector $(0, 1)^T$ is still in the standard basis! In order to change its representation to the eigenbasis, we multiply it with $E^{-1}$, as discussed above. We write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^{-1} \begin{pmatrix} 0 \\ 1 \end{pmatrix} = \begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} \enspace . %]]></script>
<p>Let’s use this to compute, say, the $10^{th}$ Fibonacci number (which is 55) in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">D</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">diag</span><span class="p">(</span><span class="nf">c</span><span class="p">(</span><span class="n">lambda1</span><span class="p">,</span><span class="w"> </span><span class="n">lambda2</span><span class="p">))</span><span class="w">
</span><span class="n">D</span><span class="o">^</span><span class="m">10</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] 55.003636123
## [2,] -0.003636123</code></pre></figure>
<p>Ha! This didn’t quite work, did it? We got the answer for $F_{10}$ roughly when rounding, but $F_{11}$ is completely off. What did we miss? Well, this is in fact the correct answer — it is just in the wrong basis! We have to convert this from the eigenbasis to the standard basis. To do this, observe that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
b &= E^{-1} S \, a \\
E b &= S \, a \\
E b &= a \enspace ,
\end{aligned} %]]></script>
<p>since $S$ is the identity matrix. Thus, all we have to do is to multiply with $E$:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">E</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">D</span><span class="o">^</span><span class="m">10</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="n">solve</span><span class="p">(</span><span class="n">E</span><span class="p">)</span><span class="w"> </span><span class="o">%*%</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [,1]
## [1,] 55
## [2,] 89</code></pre></figure>
<p>which is the correct solution. To get the closed-form solution algebraically, we first invert the matrix $E$:</p>
<script type="math/tex; mode=display">% <![CDATA[
E^{-1} = -\frac{1}{\sqrt{5}} \begin{pmatrix} \frac{1 - \sqrt{5}}{2} & -1 \\ - \frac{1 + \sqrt{5}}{2} & 1\end{pmatrix} \enspace , %]]></script>
<p>and we write:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\begin{pmatrix} F_n \\ F_{n+1} \end{pmatrix} &= \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n -\frac{1}{\sqrt{5}} \begin{pmatrix} \frac{1 - \sqrt{5}}{2} & -1 \\ - \frac{1 + \sqrt{5}}{2} & 1\end{pmatrix}\begin{pmatrix} 0 \\ 1 \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} \frac{1 + \sqrt{5}}{2} & 0 \\ 0 & \frac{1 - \sqrt{5}}{2} \end{pmatrix}^n \begin{pmatrix} -1 \\ 1 \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} 1 & 1 \\ \frac{1 + \sqrt{5}}{2} & \frac{1 - \sqrt{5}}{2} \end{pmatrix} \begin{pmatrix} -\left(\frac{1 + \sqrt{5}}{2}\right)^n \\ \left(\frac{1 - \sqrt{5}}{2}\right)^n \end{pmatrix} \\[1em]
&= -\frac{1}{\sqrt{5}} \begin{pmatrix} -\left(\frac{1 + \sqrt{5}}{2}\right)^n + \left(\frac{1 - \sqrt{5}}{2}\right)^n \\ -\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} + \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \end{pmatrix} \\[1em]
&= \frac{1}{\sqrt{5}} \begin{pmatrix} \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \\ \left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \end{pmatrix} \enspace .
\end{aligned} %]]></script>
<p>The closed-form expression of the $n^{th}$ Fibonacci number is thus given by:</p>
<script type="math/tex; mode=display">F_n = \frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \right] \enspace .</script>
<p>We verify this in R:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_closed</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="m">1</span><span class="o">/</span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">)</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="p">(((</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="p">((</span><span class="m">1</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">print</span><span class="p">(</span><span class="n">fib_closed</span><span class="p">(</span><span class="n">seq</span><span class="p">(</span><span class="m">30</span><span class="p">)))</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 1 1 2 3 5 8 13 21 34 55
## [11] 89 144 233 377 610 987 1597 2584 4181 6765
## [21] 10946 17711 28657 46368 75025 121393 196418 317811 514229 832040</code></pre></figure>
<h1 id="the-golden-ratio">The golden ratio</h1>
<p>In the above section, we have derived a closed-form expression of the $n^{th}$ Fibonacci number. In this section, we return to an observation we have made at the beginning: there is structure in how the Fibonacci numbers grow. Johannes Kepler, after whom the university in my home town is named, (re)discovered that:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{aligned}
\lim_{n \rightarrow \infty} \frac{F_{n+1}}{F_n} &= \lim_{n \rightarrow \infty} \frac{\frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1} \right]}{\frac{1}{\sqrt{5}} \left[ \left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n \right]} \\[1em]
&= \lim_{n \rightarrow \infty} \frac{\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1} - \left(\frac{1 - \sqrt{5}}{2}\right)^{n+1}}{\left(\frac{1 + \sqrt{5}}{2}\right)^n - \left(\frac{1 - \sqrt{5}}{2}\right)^n} \\[1em]
&= \frac{\left(\frac{1 + \sqrt{5}}{2}\right)^{n+1}}{\left(\frac{1 + \sqrt{5}}{2}\right)^n} \\[1em]
&= \frac{1 + \sqrt{5}}{2} \approx 1.618 \enspace ,
\end{aligned} %]]></script>
<p>which is the <a href="https://en.wikipedia.org/wiki/Golden_ratio">golden ratio</a>. The golden ratio $\phi$ denotes that the ratio of two parts is equal to the ratio of the sum of the parts to the larger part, i.e., for $a > b > 0$:</p>
<script type="math/tex; mode=display">\phi \equiv \frac{a}{b} = \frac{a + b}{a} \enspace .</script>
<p>We have observed this empirically in the first Figure, which visualized the differences in the log of two consecutive Fibonacci numbers, and which yielded already for small $n$:</p>
<script type="math/tex; mode=display">\text{log} \, F_{n+1} - \text{log} \, F_n = \text{log} \, \frac{F_{n + 1}}{F_n} \approx 0.4812 \enspace ,</script>
<p>which exponentiated yields the golden ratio. Observe that $\left(\frac{1 - \sqrt{5}}{2}\right)^n$ goes to zero very quickly as $n$ grows so that we can compute the $n^{th}$ Fibonacci number by:</p>
<script type="math/tex; mode=display">F_n = \left \lfloor \frac{1}{\sqrt{5}} \phi^n \right \rceil \enspace ,</script>
<p>where we simply round to the nearest integer. To finally answer Fibonacci’s puzzle:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">fib_golden</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="nf">round</span><span class="p">(((</span><span class="m">1</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="o">/</span><span class="m">2</span><span class="p">)</span><span class="o">^</span><span class="n">n</span><span class="w"> </span><span class="o">/</span><span class="w"> </span><span class="nf">sqrt</span><span class="p">(</span><span class="m">5</span><span class="p">))</span><span class="w">
</span><span class="n">fib_golden</span><span class="p">(</span><span class="m">12</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## [1] 144</code></pre></figure>
<p>After a mere twelve months of incest, there are 144 rabbit pairs!<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup></p>
<p>There are various <a href="https://en.wikipedia.org/wiki/Generalizations_of_Fibonacci_numbers">generalizations</a> of the Fibonacci sequence. One such generalization is to allow higher orders $k$ in the sequence, which for $k = 3$ is known as the <a href="https://www.youtube.com/watch?v=fMJflV_GUpU">Tribonacci sequence</a>. Our approach for $k = 2$ can be straightforwardly generalized to account for any order $k$ (if you want to go down a rabbit hole, see for example <a href="https://math.stackexchange.com/questions/41667/fibonacci-tribonacci-and-other-similar-sequences">this</a>).</p>
<h1 id="conclusion">Conclusion</h1>
<p>In this blog post, we have taken a detailed look at the Fibonacci sequence. In particular, we saw that it is the answer to a puzzle about procreating rabbits, and how to speed up a recursive algorithm for finding the $n^{th}$ Fibonacci number. We then used ideas from linear algebra to arrive at a closed-form expression of the $n^{th}$ Fibonacci number. Specifically, we have noted that the Fibonacci sequence is a linear recurrence relation — it can be viewed as repeatedly applying a linear map. With this insight, we observed that the matrix of the linear map is non-diagonal, which makes repeated execution tedious; diagonal matrices, on the other hand, are easy to multiply. We arrived at a diagonal matrix by changing the basis from the standard basis to the basis of eigenvectors, which led to a diagonal matrix of eigenvalues for the linear map. With this representation, the $n^{th}$ Fibonacci number is available in closed-form. In order to get it into the standard basis, we had to change basis back from the eigenbasis. We also saw how the Fibonacci numbers relate to the golden ratio $\phi$.</p>
<hr />
<p>I would like to thank Don van den Bergh, Jonas Haslbeck, and Sophia Crüwell for helpful comments on this blog post.</p>
<hr />
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>This is the main reason why the Hinu-Arabic numeral system took over. The belief that it is easier to multiply and divide using Hindu-Arabic numerals is <a href="https://thonyc.wordpress.com/2017/02/10/the-widespread-and-persistent-myth-that-it-is-easier-to-multiply-and-divide-with-hindu-arabic-numerals-than-with-roman-ones/">incorrect</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>This blog post is inspired by exercise 16 on p. 161 in <a href="http://linear.axler.net/">Linear Algebra Done Right</a>. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:3">
<p>I have learned that there is already (very good) ink spilled on this topic, see for example <a href="https://bosker.wordpress.com/2011/04/29/the-worst-algorithm-in-the-world/">here</a> and <a href="https://bosker.wordpress.com/2011/07/27/computing-fibonacci-numbers-using-binet%E2%80%99s-formula/">here</a>. A nice essay is also <a href="https://opinionator.blogs.nytimes.com/2012/09/24/proportion-control/?mtrref=undefined&gwh=C0500419D79A9E5B64F17ABC970C5125&gwt=pay">this</a> piece by Steve Strogatz, who, by the way, wrote a wonderful book called <a href="https://www.goodreads.com/book/show/354421.Sync">Sync</a>. He’s also been on Sean Carroll’s Mindscape podcast, listen <a href="https://www.preposterousuniverse.com/podcast/2019/04/08/episode-41-steven-strogatz-on-synchronization-networks-and-the-emergence-of-complex-behavior/">here</a>. <a href="#fnref:3" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>If you forget everything that is written in this blog post, but through it were made aware of the videos by 3Blue1Brown (or <a href="https://www.numberphile.com/podcast/3blue1brown">Grant Sanderson</a>, as he is known in the real world), then I consider this blog post a success. <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>The downside of the closed-form solution is that it is difficult to calculate the power of the square root with high accuracy. In fact, <em>fib_golden</em> is incorrect for $n > 70$. Our <em>fib_mem</em> implementation is also incorrect, but only for $n > 93$. (I’ve compared it against Fibonacci numbers calculated from <a href="https://www.miniwebtool.com/list-of-fibonacci-numbers/?number=100">here</a>). <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderLeonardo Bonacci, better known as Fibonacci, has influenced our lives profoundly. At the beginning of the $13^{th}$ century, he introduced the Hindu-Arabic numeral system to Europe. Instead of the Roman numbers, where I stands for one, V for five, X for ten, and so on, the Hindu-Arabic numeral system uses position to index magnitude. This leads to much shorter expressions for large numbers.1 While the history of the numerical system is fascinating, this blog post will look at what Fibonacci is arguably most well known for: the Fibonacci sequence. In particular, we will use ideas from linear algebra to come up with a closed-form expression of the $n^{th}$ Fibonacci number2. On our journey to get there, we will also gain some insights about recursion in R.3 The rabbit puzzle In Liber Abaci, Fibonacci poses the following question (paraphrasing): Suppose we have two newly-born rabbits, one female and one male. Suppose these rabbits produce another pair of female and male rabbits after one month. These newly-born rabbits will, in turn, also mate after one month, producing another pair, and so on. Rabbits never die. How many pairs of rabbits exist after one year? The Figure below illustrates this process. Every point denotes one rabbit pair over time. To indicate that every newborn rabbit pair needs to wait one month before producing new rabbits, rabbits that are not fertile yet are coloured in grey, while rabbits ready to procreate are coloured in red. We can derive a linear recurrence relation that describes the Fibonacci sequence. In particular, note that rabbits never die. Thus, at time point $n$, all rabbits from time point $n - 1$ carry over. Additionally, we know that every fertile rabbit pair will produce a new rabbit pair. However, they have to wait one month, so that the amount of fertile rabbits equals the amount of rabbits at time point $n - 2$. Resultingly, the Fibonacci sequence {$F_n$}$_{n=1}^{\infty}$ is: for $n \geq 3$ and $F_1 = F_2 = 1$. Before we derive a closed-form expression that computes the $n^{th}$ Fibonacci number directly, in the next section, we play around with alternative, more straightforward solutions in R. Implementation in R We can write a wholly inefficient, but beautiful program to compute the $n^{th}$ Fibonacci number: This is the main reason why the Hinu-Arabic numeral system took over. The belief that it is easier to multiply and divide using Hindu-Arabic numerals is incorrect. ↩ This blog post is inspired by exercise 16 on p. 161 in Linear Algebra Done Right. ↩ I have learned that there is already (very good) ink spilled on this topic, see for example here and here. A nice essay is also this piece by Steve Strogatz, who, by the way, wrote a wonderful book called Sync. He’s also been on Sean Carroll’s Mindscape podcast, listen here. ↩Spurious correlations and random walks2019-06-29T10:00:00+00:002019-06-29T10:00:00+00:00https://fabiandablander.com/r/Spurious-Correlation<p>The number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup> In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: <em>completely independent processes can be correlated substantially</em>.</p>
<h2 id="ar1-processes-and-random-walks">AR(1) processes and random walks</h2>
<p>Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as:</p>
<script type="math/tex; mode=display">Y_t = \phi \, Y_{t-1} + \epsilon_t \enspace ,</script>
<p>where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a <em>random walk</em>. We can simulate from these using the following code:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">simulate_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">rep</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">t</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="p">[</span><span class="n">t</span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">phi</span><span class="o">*</span><span class="n">y</span><span class="p">[</span><span class="n">t</span><span class="m">-1</span><span class="p">]</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">rnorm</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">y</span><span class="w">
</span><span class="p">}</span></code></pre></figure>
<p>The following R code simulates data from three independent random walks and an AR(1) process with $\phi = 0.5$; the Figure below visualizes them.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw1</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw2</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">rw3</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.5</span><span class="p">)</span></code></pre></figure>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-3-1.png" title="plot of chunk unnamed-chunk-3" alt="plot of chunk unnamed-chunk-3" style="display: block; margin: auto;" /></p>
<p>As we can see from the plot, the AR(1) process seems pretty well-behaved. This is in contrast to the three random walks: all of them have an initial upwards trend, after which the red line keeps on growing, while the blue line makes a downward jump. In contrast to AR(1) processes, random walks are <em>not stationary</em> since their variance is not constant across time. For some very good lecture notes on time-series analysis, see <a href="https://www.economodel.com/time-series-analysis">here</a>.</p>
<h2 id="spurious-correlations-of-random-walks">Spurious correlations of random walks</h2>
<p>If we look at the correlations of these three random walks across time points, we find that they are substantial:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="nf">round</span><span class="p">(</span><span class="n">cor</span><span class="p">(</span><span class="n">cbind</span><span class="p">(</span><span class="n">red</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw1</span><span class="p">,</span><span class="w"> </span><span class="n">green</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw2</span><span class="p">,</span><span class="w"> </span><span class="n">blue</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">rw3</span><span class="p">)),</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## red green blue
## red 1.00 -0.49 -0.29
## green -0.49 1.00 0.59
## blue -0.29 0.59 1.00</code></pre></figure>
<p>I hope that this is at least a little bit of a shock. Upon reflection, however, it is clear that we are blundering: computing the correlation across time ignores the dependency between data points that is so typical of time-series data. To get more data about what is going on, we conduct a small simulation study.</p>
<p>In particular, we want to get an intuition of how this spurious correlation behaves with increasing sample sizes. We therefore simulate two independent random walks for sample sizes $n \in [50, 100, 200, 500, 1000, 2000]$ and compute their Pearson correlation, the test-statistic, and whether $p < \alpha$, where we set $\alpha$ to some an arbitrary value, say $\alpha = 0.05$. We repeated this 100 times and report the average of these quantities.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">library</span><span class="p">(</span><span class="s1">'dplyr'</span><span class="p">)</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">50</span><span class="p">,</span><span class="w"> </span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">)</span><span class="w">
</span><span class="n">comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ns</span><span class="p">)</span><span class="w">
</span><span class="n">ncomb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">comb</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncomb</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">5</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ix'</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'cor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">,</span><span class="w"> </span><span class="s1">'pval'</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ncomb</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cor.test</span><span class="p">(</span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">),</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">1</span><span class="p">))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">p.value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">tab</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">n</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">avg_abs_corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">cor</span><span class="p">)),</span><span class="w">
</span><span class="n">avg_abs_tstat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">tstat</span><span class="p">)),</span><span class="w">
</span><span class="n">percent_sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">pval</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w"> </span><span class="n">data.frame</span><span class="w">
</span><span class="nf">round</span><span class="p">(</span><span class="n">tab</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## n avg_abs_corr avg_abs_tstat percent_sig
## 1 50 0.41 3.57 0.71
## 2 100 0.46 6.58 0.85
## 3 200 0.45 8.88 0.85
## 4 500 0.37 10.63 0.86
## 5 1000 0.41 17.05 0.88
## 6 2000 0.39 23.39 0.97</code></pre></figure>
<p>We observe that the average absolute correlation is very similar across $n$, but the test statistic grows with increased $n$, which naturally results in many more false rejections of the null hypothesis of no correlation between the two random walks.</p>
<p>To my knowledge, Granger and Newbold (1974) were the first to point out this puzzling fact.<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup> They regress one random walk onto the other instead of computing the Pearson correlation. (Note that the test statistic is the same). In a regression setting, we write:</p>
<script type="math/tex; mode=display">Y = \beta_0 + \beta_1 X + \epsilon \enspace ,</script>
<p>where we assume that $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$ (see also a <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">previous</a> blog post). This is evidently violated when performing linear regression on two random walks, as demonstrated by the residual plot below.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-6-1.png" title="plot of chunk unnamed-chunk-6" alt="plot of chunk unnamed-chunk-6" style="display: block; margin: auto;" /></p>
<p>Similar as above, we can have an AR(1) process on the residuals:</p>
<script type="math/tex; mode=display">\epsilon_t = \delta \epsilon_{t-1} + \eta_t \enspace ,</script>
<p>and test whether $\delta = 0$. We can do so using the <a href="https://en.wikipedia.org/wiki/Durbin%E2%80%93Watson_statistic">Durbin-Watson test</a>, which yields:</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">car</span><span class="o">::</span><span class="n">durbinWatsonTest</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></code></pre></figure>
<figure class="highlight"><pre><code class="language-text" data-lang="text">## lag Autocorrelation D-W Statistic p-value
## 1 0.9357562 0.08623868 0
## Alternative hypothesis: rho != 0</code></pre></figure>
<p>This indicates substantial autocorrelation, violating our modeling assumption of independent residuals. In the next section, we look at the deeper mathematical reasons for why we get such spurious correlation. In the Post Scriptum, we relax the constraint that $\phi = 1$ and look at how spurious correlation behaves for AR(1) processes.</p>
<!-- In the next section, we will look more formally into the curious fact that two independent random walks are correlated. To understand why even with large $n$ the estimation goes awry, we have to make an excursion into asymptotia. -->
<h2 id="inconsistent-estimation">Inconsistent estimation</h2>
<p>The simulation results from the random walk simulations showed that the average (absolute) correlation stays roughly constant, while the test statistic increases with $n$. This indicates a problem with our estimator for the correlation. Because it is slightly easier to study, we focus on the regression parameter $\beta_1$ instead of the Pearson correlation. <a href="https://fdabl.github.io/r/Curve-Fitting-Gaussian.html">Recall</a> that our regression estimate is</p>
<script type="math/tex; mode=display">\hat{\beta}_1 = \frac{\sum_{t=1}^N (x_t - \bar{x})(y_t - \bar{y})}{\sqrt{\sum_{t=1}^N (x_t - \bar{x})^2 \sum_{t=1}^N (y_t - \bar{y})^2}} \enspace ,</script>
<p>where $\bar{x}$ and $\bar{y}$ are the empirical means of the realizations $x_t$ and $y_t$ of the AR(1) processes $X_t$ and $Y_t$, respectively. The test statistic associated with the null hypothesis $\beta_1 = 0$ is</p>
<script type="math/tex; mode=display">t_{\text{statistic}} := \frac{\hat{\beta_1} - 0}{se(\hat{\beta_1})} = \frac{\hat{\beta_1}}{\hat{\sigma} / \sqrt{\sum_{t=1}^N (x_t - \bar{x})^2}} \enspace ,</script>
<p>where $\hat{\sigma}$ is the estimated standard deviation of the error. In simple linear regression, the test statistic follows a t-distribution with $n - 2$ degrees of freedom (it takes two parameters to fit a straight line). In the case of independent random walks, however, the test statistic does not have a limiting distribution; in fact, as $n \rightarrow \infty$, the distribution of $t_{\text{statistic}}$ diverges (Phillips, 1986).</p>
<p>To get an intuition for this, we plot the bootstrapped sampling distributions for $\beta_1$ and $t_{\text{statistic}}$, both for the case of regressing one independent AR(1) process onto another, and for random walk regression.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">regress_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">y</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">x</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">coef</span><span class="p">(</span><span class="n">summary</span><span class="p">(</span><span class="n">lm</span><span class="p">(</span><span class="n">y</span><span class="w"> </span><span class="o">~</span><span class="w"> </span><span class="n">x</span><span class="p">)))[</span><span class="m">2</span><span class="p">,</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">)]</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">bootstrap_limit</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">,</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">200</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">times</span><span class="w"> </span><span class="o">*</span><span class="w"> </span><span class="nf">length</span><span class="p">(</span><span class="n">ns</span><span class="p">),</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'b1'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">)</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">n</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">ns</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">coefs</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">regress_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">sigma</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">coefs</span><span class="p">)</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">ix</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="m">1</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">ns</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="m">100</span><span class="p">,</span><span class="w"> </span><span class="m">200</span><span class="p">,</span><span class="w"> </span><span class="m">500</span><span class="p">,</span><span class="w"> </span><span class="m">1000</span><span class="p">,</span><span class="w"> </span><span class="m">2000</span><span class="p">)</span><span class="w">
</span><span class="n">res_ar</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bootstrap_limit</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">.5</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span><span class="w">
</span><span class="n">res_rw</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">bootstrap_limit</span><span class="p">(</span><span class="n">ns</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.1</span><span class="p">)</span></code></pre></figure>
<p>The Figure below illustrates how things go wrong when regressing one independent random walk onto the other. In contrast to the estimate for the AR(1) regression, the estimate $\hat{\beta}_1$ does not decrease in the case of a random walk regression. Instead, it stays roughly within $[-0.75, 0.75]$ across all $n$. This shines further light on the initial simulation results that the average correlation stays roughly the same. Moreover, in contrast AR(1) regression for which the distribution of the test statistic does not change, the distribution of the test statistic for the random walk regression seems to diverge. This explains why we the proportion of false positives increases with $n$.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-9-1.png" title="plot of chunk unnamed-chunk-9" alt="plot of chunk unnamed-chunk-9" style="display: block; margin: auto;" /></p>
<p>Rigorous arguments of the above statements can be found in Phillips (1986) and Hamilton (1994, pp. 577).<sup id="fnref:4"><a href="#fn:4" class="footnote">3</a></sup> The explanations feature some nice asympotic arguments which I would love go into in detail; however, I’m currently in Santa Fe for a summer school that has a very tightly packed programme. On that note: it is <a href="https://www.santafe.edu/engage/learn/schools/sfi-complex-systems-summer-school">very, very cool</a>. You should definitely apply next year! In addition to the stimulating lectures, wonderful people, and exciting projects, the surroundings are stunning<sup id="fnref:5"><a href="#fn:5" class="footnote">4</a></sup>.</p>
<div style="text-align:center;">
<img src="../assets/img/IAIA.jpeg" align="center" style="padding-top: 10px; padding-bottom: 10px;" width="720" height="620" />
</div>
<!-- ### Brownian Motion -->
<!-- The type of random walk we focused on in this blog post takes place in discrete, equidistant time steps.[^3] If we take the limit of $n \rightarrow \infty$, however, we move from a discrete time random walk to a continuous time Brownian motion. The gist of the argument is to make the difference $\Delta Y_t$ between time points $Y_{t+1}$ and $Y_t$ infinitesimally small. Recall that the Gaussian distribution is [closed under addition](https://fdabl.github.io/statistics/Two-Properties.html), and that -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- Y_t &= \sum_{i=1}^t \eta_i \sim \mathcal{N}(0, t \cdot \sigma^2) \enspace \\[1em] -->
<!-- \Delta Y_t &= Y_{t+1} - Y_{t} = \sum_{i=1}^{t+1} \eta_i - \sum_{j=1}^t \eta_j = \eta_t \sim \mathcal{N}(0, \sigma^2) \enspace . -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- We may cut $\eta_t$ into $n$ pieces -->
<!-- $$ -->
<!-- \eta_t = \eta_{1t} + \eta_{2t} + \ldots + \eta_{nt} \enspace , -->
<!-- $$ -->
<!-- where $\eta_{it} \sim \mathcal{N}(0, \frac{1}{n})$. Therefore, as we increase $n$, the discrete-time process is defined at a finer and finer grid. For $n \rightarrow \infty$, this results into the continuous-time Brownian motion, which we denote as $W(t)$, where $W: t \in [0, 1] \rightarrow \mathbb{R}$. -->
<!-- ## Solutions -->
<!-- Hamilton (1994, p. 562) discusses three solutions. One of them is to *difference* the data before doing the regression, i.e., -->
<!-- $$ -->
<!-- \Delta Y_t = \beta_0 + \beta_1 \Delta X_t + \epsilon_t \enspace , -->
<!-- $$ -->
<!-- where $\Delta Y_t = Y_{t+1} - Y_t$. This does in fact work: -->
<!-- ```{r} -->
<!-- broom::tidy(lm(diff(rw1) ~ diff(rw2))) -->
<!-- ``` -->
<!-- ```{r, echo = FALSE} -->
<!-- n <- 1000 -->
<!-- dat <- matrix(0, nrow = n, ncol = 2) -->
<!-- B <- cbind( -->
<!-- c(.4, .2), -->
<!-- c(-.2, .4) -->
<!-- ) -->
<!-- for (i in seq(2, n)) { -->
<!-- z <- rnorm(1) -->
<!-- # dat[i, ] <- dat[i-1, ] %*% B + rnorm(2) -->
<!-- dat[i, ] <- c(.8, .4) * z + rnorm(2) -->
<!-- } -->
<!-- ``` -->
<!-- Why? Let $\eta_t$ and $\psi_t$ denote the errors of the two processes $Y$ and $X$, respectively, distributed according to zero-mean Gaussian with variances $\sigma_y$ and $\sigma_x$. We write -->
<!-- $$ -->
<!-- \Delta Y_t = \sum_{i=1}^{t+1} \eta_i - \sum_{i=1}^{t} \eta_i = \eta_{t+1} \sim \mathcal{N}(0, \sigma_y^2) \\[1em] -->
<!-- \Delta X_t = \sum_{i=1}^{t+1} \psi_i - \sum_{i=1}^{t} \eta_i = \psi_{t+1} \sim \mathcal{N}(0, \sigma_x^2) \enspace . -->
<!-- $$ -->
<!-- Now, since the respective differences are independent of each other, their correlation will be zero. -->
<!-- However, Hamilton notes that if the time-series are really stationary ($\vert \phi \lvert < 1$), then this can result in misspecified regression. Moreover, if $Y$ and $X$ are non-stationary but *cointegrated processes*, then this also will result in misspecification. -->
<h2 id="conclusion">Conclusion</h2>
<p>“Correlation does not imply causation” is a common response to apparently spurious correlation. The idea is that we observe spurious associations because we do not have the full causal picture, as in the example of storks and human babies. In this blog post, we have seen that spurious correlation can be due to solely statistical reasons. In particular, we have seen that two independent random walks can be highly correlated. This can be diagnosed by looking at the residuals, which will <em>not</em> be independent and identically distributed, but will show a pronounced autocorrelation.</p>
<p>The mathematical explanation for the spurious correlation is not trivial. Using simulations, we found that the estimate of $\beta_1$ does not converge to the true value in the case of regressing one independent random walk onto another. Moreover, the test statistic diverges, meaning that with increasing sample size we are almost certain to reject the null hypothesis of no association. The spurious correlation occurs because our estimate is not consistent, which is a purely statistical explanation that does not invoke causal reasoning.</p>
<hr />
<p><em>I want to thank Toni Pichler and Andrea Bacilieri for helpful comments on this blog post.</em></p>
<hr />
<h2 id="post-scriptum">Post Scriptum</h2>
<!-- ### Mean and variance of AR(1) and random walk -->
<!-- To better understand the differences between AR(1) processes and random walks, we look at their respective first two moments. We write out the process for some window of length $j$, and then recursively substitute: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- Y_t &= \phi \, Y_{t-1} + \epsilon_t \\[.5em] -->
<!-- &= \phi \, \left(\phi \, Y_{t-2} + \epsilon_{t-1}\right) + \epsilon_t \\[.5em] -->
<!-- &= \phi \, \left(\phi \, \left(\phi \, Y_{t-3} + \epsilon_{t-2}\right) + \epsilon_{t-1}\right) + \epsilon_t \\[.5em] -->
<!-- &= \vdots \\[.5em] -->
<!-- &= \phi^{j + 1} \, Y_{t - (j + 1)} + \sum_{i=t}^{t - (j + 1)} \phi^i \epsilon_{t-i} \\[.5em] -->
<!-- &= \sum_{i=0}^{t-1} \phi^i \epsilon_{t-i} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where we assume that $Y_0 = 0$ is fixed. Let's compute the first two moments of this process. Exploiting linearity, we write: -->
<!-- $$ -->
<!-- \mathbb{E}[Y_t] = \mathbb{E}\left[\sum_{i=0}^{t-1} \phi^i \epsilon_{t-i}\right] = \sum_{i=0}^{t-1} \mathbb{E}\left[\phi^i \epsilon_{t-i}\right] = \sum_{i=0}^{t-1} \phi^i \mathbb{E}\left[\epsilon_{t-i}\right] = 0 \enspace . -->
<!-- $$ -->
<!-- This is also true for $\phi = 1$, i.e., a random walk. For the variance, we write: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{Var}\left[Y_t\right] &= \mathbb{E}\left[\left(Y_t - \mathbb{E}[Y_t]\right)^2\right] -->
<!-- = \mathbb{E}\left[Y_t^2\right] -->
<!-- = \mathbb{E}\left[\left(\sum_{i=0}^{t-1} \phi^i \epsilon_{t-i}\right)^2\right] \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where we split the quadratic into ["diagonal"](https://math.stackexchange.com/questions/125435/what-is-the-opposite-of-a-cross-term) terms and cross-terms, the latter of which have expectation zero by our assumption that the residuals are independent: -->
<!-- $$ -->
<!-- \begin{aligned} -->
<!-- \text{Var}\left[Y_t\right] &= \mathbb{E}\left[\sum_{i=0}^{t - 1} \left(\phi^i \epsilon_{t-i}\right)^2 + \sum_{i=0}^{t - 1} \sum_{j\neq i}^{t - 1} \left(\phi^i \epsilon_{t-i}\right) \left(\phi^j \epsilon_{t-j}\right)\right] \\[.5em] -->
<!-- &= \mathbb{E}\left[\sum_{i=0}^{t - 1} \left(\phi^i \epsilon_{t-i}\right)^2\right] \\[.5em] -->
<!-- &= \sum_{i=0}^{t - 1} \mathbb{E}\left[\left(\phi^i \epsilon_{t-i}\right)^2\right] \\[.5em] -->
<!-- &= \sum_{i=0}^{t - 1} \left(\phi^i\right)^2 \mathbb{E}\left[\epsilon_{t-i}^2\right] \\[.5em] -->
<!-- &= \sigma^2\sum_{i=0}^{t - 1} \left(\phi^2\right)^i \\[.5em] -->
<!-- &= \sigma^2 \frac{1}{1 - \phi^2} \enspace , -->
<!-- \end{aligned} -->
<!-- $$ -->
<!-- where the last line follows when $N \rightarrow \infty$ for $\vert\phi\vert < 1$ from a geometric series. For a random walk, however, this is not a geometric series anymore; it therefore does not converge, and the variance of a random walk does not exist. -->
<h3 id="spurious-correlation-of-ar1-processes">Spurious correlation of AR(1) processes</h3>
<p>In the main text, we have looked at how the spurious correlation behaves for a random walk. Here, we study how the spurious correlation behaves as a function of $\phi \in [0, 1]$. We focus on sample sizes of $n = 200$, and adapt the simulation code from above.</p>
<figure class="highlight"><pre><code class="language-r" data-lang="r"><span class="n">set.seed</span><span class="p">(</span><span class="m">1</span><span class="p">)</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">200</span><span class="w">
</span><span class="n">times</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="m">100</span><span class="w">
</span><span class="n">phis</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="m">0</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">,</span><span class="w"> </span><span class="m">.02</span><span class="p">)</span><span class="w">
</span><span class="n">comb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">expand.grid</span><span class="p">(</span><span class="n">times</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">times</span><span class="p">),</span><span class="w"> </span><span class="n">n</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phis</span><span class="p">)</span><span class="w">
</span><span class="n">ncomb</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">nrow</span><span class="p">(</span><span class="n">comb</span><span class="p">)</span><span class="w">
</span><span class="n">res</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">matrix</span><span class="p">(</span><span class="kc">NA</span><span class="p">,</span><span class="w"> </span><span class="n">nrow</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">ncomb</span><span class="p">,</span><span class="w"> </span><span class="n">ncol</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">6</span><span class="p">)</span><span class="w">
</span><span class="n">colnames</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s1">'ix'</span><span class="p">,</span><span class="w"> </span><span class="s1">'n'</span><span class="p">,</span><span class="w"> </span><span class="s1">'phi'</span><span class="p">,</span><span class="w"> </span><span class="s1">'cor'</span><span class="p">,</span><span class="w"> </span><span class="s1">'tstat'</span><span class="p">,</span><span class="w"> </span><span class="s1">'pval'</span><span class="p">)</span><span class="w">
</span><span class="k">for</span><span class="w"> </span><span class="p">(</span><span class="n">i</span><span class="w"> </span><span class="k">in</span><span class="w"> </span><span class="n">seq</span><span class="p">(</span><span class="n">ncomb</span><span class="p">))</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">ix</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">1</span><span class="p">]</span><span class="w">
</span><span class="n">n</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">2</span><span class="p">]</span><span class="w">
</span><span class="n">phi</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">comb</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="m">3</span><span class="p">]</span><span class="w">
</span><span class="n">test</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">cor.test</span><span class="p">(</span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">),</span><span class="w"> </span><span class="n">simulate_ar</span><span class="p">(</span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">phi</span><span class="p">))</span><span class="w">
</span><span class="n">res</span><span class="p">[</span><span class="n">i</span><span class="p">,</span><span class="w"> </span><span class="p">]</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="n">ix</span><span class="p">,</span><span class="w"> </span><span class="n">n</span><span class="p">,</span><span class="w"> </span><span class="n">phi</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">estimate</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">statistic</span><span class="p">,</span><span class="w"> </span><span class="n">test</span><span class="o">$</span><span class="n">p.value</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span><span class="n">dat</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">res</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">group_by</span><span class="p">(</span><span class="n">phi</span><span class="p">)</span><span class="w"> </span><span class="o">%>%</span><span class="w">
</span><span class="n">summarize</span><span class="p">(</span><span class="w">
</span><span class="n">avg_abs_corr</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">cor</span><span class="p">)),</span><span class="w">
</span><span class="n">avg_abs_tstat</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="nf">abs</span><span class="p">(</span><span class="n">tstat</span><span class="p">)),</span><span class="w">
</span><span class="n">percent_sig</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">mean</span><span class="p">(</span><span class="n">pval</span><span class="w"> </span><span class="o"><</span><span class="w"> </span><span class="m">.05</span><span class="p">)</span><span class="w">
</span><span class="p">)</span></code></pre></figure>
<p>The Figure below shows that the issue of spurious correlation gets progressively worse as the AR(1) process approaches a random walk (i.e., $\phi = 1$). While this is true, the regression estimate remains consistent.</p>
<p><img src="/assets/img/2019-06-23-Spurious-Correlation.Rmd/unnamed-chunk-11-1.png" title="plot of chunk unnamed-chunk-11" alt="plot of chunk unnamed-chunk-11" style="display: block; margin: auto;" /></p>
<h2 id="references">References</h2>
<ul>
<li>Granger, C. W., & Newbold, P. (<a href="http://wolfweb.unr.edu/~zal/STAT758/Granger_Newbold_1974.pdf">1974</a>). Spurious regressions in econometrics. <em>Journal of Econometrics, 2</em>(2), 111-120.</li>
<li>Hamilton, J. D. (<a href="https://press.princeton.edu/titles/5386.html">1994</a>). Time Series Analysis. P. Princeton, US: Princeton University Press.</li>
<li>Kuiper, R. M., & Ryan, O. (<a href="https://www.tandfonline.com/doi/full/10.1080/10705511.2018.1431046">2018</a>). Drawing conclusions from cross-lagged relationships: Re-considering the role of the time-interval. <em>Structural Equation Modeling: A Multidisciplinary Journal, 25</em>(5), 809-823.</li>
<li>Phillips, P. C. (<a href="http://dido.econ.yale.edu/korora/phillips/pubs/art/a044.pdf">1986</a>). Understanding spurious regressions in econometrics. <em>Journal of Econometrics, 33</em>(3), 311-340.</li>
<li>Matthews, R. Storks deliver babies (p = 0.008) (<a href="https://onlinelibrary.wiley.com/doi/abs/10.1111/1467-9639.00013?casa_token=cWUllTD9P14AAAAA:PRERZz-uS2z9xX3DGt0-Qize94FuZuw-35s-2ECfUDY9Oi3J1m83cZh8EBHGlGh7fwQ2WHShOQuwB-YO">2000</a>). <em>Teaching Statistics 22</em>(2), 36–38.</li>
</ul>
<h2 id="footnotes">Footnotes</h2>
<div class="footnotes">
<ol>
<li id="fn:1">
<p>There are, of course, many <a href="https://www.tylervigen.com/spurious-correlations">more</a>. <a href="#fnref:1" class="reversefootnote">↩</a></p>
</li>
<li id="fn:2">
<p>Thanks to Toni Pichler for drawing my attention to the fact that independent random walks are correlated, and Andrea Bacilieri for providing me with the classic references. <a href="#fnref:2" class="reversefootnote">↩</a></p>
</li>
<li id="fn:4">
<p>Moreover, one way to avoid the spurious correlation is to <em>difference</em> the time-series. For other approaches, see Hamilton (1994, pp. 561). <a href="#fnref:4" class="reversefootnote">↩</a></p>
</li>
<li id="fn:5">
<p>This awesome picture was made by Luther Seet. <a href="#fnref:5" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>Fabian DablanderThe number of storks and the number of human babies delivered are positively correlated (Matthews, 2000). This is a classic example of a spurious correlation which has a causal explanation: a third variable, say economic development, is likely to cause both an increase in storks and an increase in the number of human babies, hence the correlation.1 In this blog post, I discuss a more subtle case of spurious correlation, one that is not of causal but of statistical nature: completely independent processes can be correlated substantially. AR(1) processes and random walks Moods, stockmarkets, the weather: everything changes, everything is in flux. The simplest model to describe change is an auto-regressive (AR) process of order one. Let $Y_t$ be a random variable where $t = [1, \ldots T]$ indexes discrete time. We write an AR(1) process as: where $\phi$ gives the correlation with the previous observation, and where $\epsilon_t \sim \mathcal{N}(0, \sigma^2)$. For $\phi = 1$ the process is called a random walk. We can simulate from these using the following code: There are, of course, many more. ↩