12th January, 2021

Causal inference

  • In all previous examples, we can achieve good to excellent predictions
  • But we do not merely want to predict systems, but also change them!

  • Causal inference goes beyond prediction by modeling the outcome of interventions
    • Ask not what \(Y\) is likely to be if \(X\) happened to be \(x\)
    • Ask what \(Y\) is likely to be if \(X\) were set to \(x\)

  • It provides tools that allow us to draw causal conclusions from observational data
    • Because randomized experiments are often infeasible, unethical, or impossible

Causal inference

  • In all previous examples, we can achieve good to excellent predictions
  • But we do not merely want to predict systems, but also change them!

  • Causal inference goes beyond prediction and models the outcome of interventions
  • It provides tools that allow us to draw causal conclusions from observational data

Causal inference

  • In all previous examples, we can achieve good to excellent predictions
  • But we do not merely want to predict systems, but also change them!

  • Causal inference goes beyond prediction and models the outcome of interventions
  • It provides tools that allow us to draw causal conclusions from observational data

Causal inference

  • In all previous examples, we can achieve good to excellent predictions
  • But we do not merely want to predict systems, but also change them!

  • Causal inference goes beyond prediction and models the outcome of interventions
  • It provides tools that allow us to draw causal conclusions from observational data

Pearl, Glymour, & Jewell (2016)

Outline

  • 1) Core concepts
    • Causality versus causal inference
    • Relating causal to probabilistic statements (DAGs)
    • Formalizing interventions, causal effects, and confounding
    • Simpson’s paradox
    • Structural Causal Models

  • 2) Exercises I

  • 3) Causal discovery
    • PC Algorithm
    • Invariant causal prediction
    • Restricted SCMs
    • Synthetic control
    • Interrupted Time-series

  • 4) Exercises II

Core concepts

Causality versus Causal inference

  • David Hume defined a cause “[…] to be an object, followed by another, and where all the objects similar to the first, are followed by objects similar to the second.” (Hume, 1748, section VII)

  • A key problem is that most causes are not invariably followed by their effects
    • Not everybody who smokes gets lung cancer
    • Instead, smoking increases the probability of getting lung cancer

  • Causality is a big topic in philosophy
  • We are interested here in causal inference
    • Identifying and estimating causal effects in the population
    • Causal effects = numerical quantities measuring changes in the distribution of an outcome under different interventions (e.g., Hernán & Robins, 2020)

Linking association to causation

  • Reichenbach’s common cause principle: “If two random variables \(X\) and \(Y\) are statistically dependent (\(X \not \perp Y\)), then either (a) \(X\) causes \(Y\), (b) \(Y\) causes \(X\), or (c) there exists a third variable \(Z\) that causes both \(X\) and \(Y\). Further, \(X\) and \(Y\) become independent given \(Z\), i.e., \(X \perp Y \mid Z\).”

  • Probabilistic independence of \(X\) and \(Y\) means that
    • \(P(Y, X) = P(Y \mid X) \, P(X) = P(Y) \, P(X)\)
  • Probabilistic independence of \(X\) and \(Y\) given \(Z\) means that
    • \(P(Y, X, Z) = P(Y \mid Z, X) \, P(Z, X) = P(Y \mid Z) \, P(Z, X)\)

  • The language of probability is limiting
  • We cannot, for example, state that rain causes the streets to be wet
  • We use Directed Acyclic Graphs (DAGs) to express such causal statements

Pearl (2000), Spirtes et al. (1993), Peters et al. (2017)

Three fundamental structures

  • DAGs depict causal relations and imply certain (conditional) independencies

Three fundamental structures

  • DAGs depict causal relations and imply certain (conditional) independencies

Three fundamental structures

  • DAGs depict causal relations and imply certain (conditional) independencies

Large DAGs and d-separation

  • d-separation allows us to read off (conditional) independencies from any DAG
  • We need to define a few concepts first:
    • A path from \(X\) to \(Y\) is a sequence of nodes & edges such that the start & end nodes are \(X\) and \(Y\)
    • A conditioning set \(\mathcal{L}\) is the set of nodes we condition on (it can be empty)
    • Conditioning on a non-collider along a path blocks that path
    • A collider along a path blocks that path
    • Conditioning on a collider or any of its descendants unblocks a path (e.g., \(X \rightarrow W \leftarrow Y\))

  • Two nodes \(X\) and \(Y\) are d-separated by \(\mathcal{L}\) if and only if members of \(\mathcal{L}\) block all paths between \(X\) and \(Y\)

Linking independence models

  • d-separation gives us an indendence model \(\perp_{\mathcal{G}}\) that is defined on graphs
  • Probability theory gives us an independence model \(\perp_{\mathcal{P}}\) that is defined on random variables

  • The causal Markov condition relates the two:

\[ X \perp_{\mathcal{G}} Y \mid Z \Rightarrow X \perp_{\mathcal{P}} Y \mid Z \enspace . \]

  • If nodes \(X\) and \(Y\) are \(d\)-separated by \(Z\), then they are probabilistically independent given \(Z\)
  • This condition implies (and is implied by) the following factorization:

\[ p(X_1, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace , \]

  • where \(n\) is the number of nodes and \(\text{pa}^{\mathcal{G}}(X_i)\) are the parents of node \(X_i\)
  • In other words, a node is independent of its non-descendants given its parents
Peters, Janzing, & Schölkopf (2017, p. 101)

Formalizing interventions

  • \(p(Y \mid X = x)\) describes what values of \(Y\) are likely if \(X\) happened to be x
    • We call this the observational distribution
  • \(p(Y \mid do(X = x))\) describes what values of \(Y\) are likely if \(X\) is set to x
    • We call this the interventional distribution

Formalizing interventions

  • \(p(Y \mid X = x)\) describes what values of \(Y\) are likely if \(X\) happened to be x
    • We call this the observational distribution
  • \(p(Y \mid do(X = x))\) describes what values of \(Y\) are likely if \(X\) is set to x
    • We call this the interventional distribution

Formalizing interventions

  • An intervention \(do(X = x)\) implies cutting all incoming edges to \(X\)

Formalizing interventions

  • \(p(Y \mid do(X = x))\) is observational distribution \(p_m(Y \mid X = x)\) in manipulated DAG
  • For the left-most and right-most DAG, we have that \(p(Y \mid do(X = x)) = p(Y \mid X = x)\)
  • For the two middle DAGs, we have to do some work:

\[ \begin{aligned} p(Y = y \mid do(X = x)) &= p_m(Y = y \mid X = x) \\[0.50em] &= \sum_{z} p_m(Y = y, Z = z \mid X = x) \hspace{5.25em} \text{... Sum rule} \\[0.50em] &= \sum_{z} p_m(Y = y \mid X = x, Z = z) \, p_m(Z = z) \hspace{1.45em} \text{... Product rule} \\[0.50em] &= \sum_{z} p(Y = y \mid X = x, Z = z) \, p(Z = z) \hspace{2.5em} \text{... Assumptions} \end{aligned} \]

  • Assumption I: Mechanism is independent of whether \(X = x\) or \(do(X = x)\) (invariance)
  • Assumption II: Interventions only change a single node (no ‘fat hand’)
  • Assumption III: There is no unobserved confounding (causal sufficiency)

Confounding and valid adjustment

  • The causal effect of \(X\) on \(Y\) is confounded if \(p(Y \mid do(X = x)) \neq p(Y \mid X = x)\)
  • In observational data, there will always be confounding factors
    • What variables should we adjust for?
    • This requires knowledge about the underlying DAG
    • (Don’t just adjust for all variables — can induce bias!)

  • One valid adjustment sets are the parents of \(X\)
  • Another one is given by the backdoor criterion (e.g., Pearl, Glymour, & Jewell, 2016, p. 61)
  • Rationale:
    • We block all spurious paths between \(X\) and \(Y\)
    • We leave all directed paths from \(X\) to \(Y\) unperturbed
    • We create no new spurious paths

Recap

  • Causal inference is distinct from causality and requires a language that goes beyond association
  • Directed Acyclic Graphs (DAGs) provide us with a way to state causal relations

  • Reichenbach’s common cause principle links associations to causation
  • It is implied by the causal Markov condition (e.g., Peters et al., 2017, p. 104), which
    • implies the factorization \(p(X_1, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i))\)
    • relates causal relationships to probabilistic relationships \(X \perp_{\mathcal{G}} Y \mid Z \Rightarrow X \perp_{\mathcal{P}} Y \mid Z\)

  • Three fundamental DAG structures: chain, fork, collider
  • For larger DAGs, \(d\)-separation helps us to find conditional independencies

  • The \(do\)-calculus formalizes interventions
  • Confounding occurs if \(p(Y \mid do(X = x)) \neq p(Y \mid X = x)\)
  • DAGs help us find adjustment sets that unconfound the causal effect, in case they exist

\[ p(Y = y \mid do(X = x)) = \sum_{z} p(Y = y \mid X = x, Z = z) \, p(Z = z) \]

Simpson’s paradox

  • 350 patients chose to take a drug and 350 chose not to
  • Given these data, is the drug helpful or harmful?
  • Should a doctor prescribe the drug to a patient?

Simpson’s paradox

  • 350 patients chose to take a drug and 350 chose not to
  • Given these data, is the drug helpful or harmful?
  • Should a doctor prescribe the drug to a patient?

Simpson’s paradox

  • The data are exactly the same in both cases
  • Statistics alone cannot provide an answer
  • We need to have an understanding of the causal story behind these data
  • We can visualize the causal relations using DAGs

Simpson’s paradox

  • Suppose we know that estrogen has a negative effect on recovery
  • Note also that more women choose the drug than men
  • Therefore, being a woman has an effect on drug taking as well as recovery
    • Should condition on gender!
    • This blocks the backdoor path \(D \leftarrow G \rightarrow R\)
    • And therefore unconfounds the effect \(D \rightarrow R\)

Simpson’s paradox

  • Blood pressure is measured after taking the drug
    • It cannot cause choosing the drug
    • Instead, it is a mechanism of how the drug works
    • Should not condition on blood pressure!

Structural Causal Models

Peters, Janzing, & Schölkopf (2017, p. 85)

Structural Causal Models

  • Structural Causal Models (SCM) as the fundamental building blocks of causal inference
    • We understand relations between variables in a SCM to be causal
    • Here, we will assume acyclic, linear SCMs with Gaussian error terms
    • The relations in such SCMs can be visualized in DAGs

  • Example:

\[ \begin{aligned} X &:= \epsilon_X \\[.5em] Y &:= X + \epsilon_Y \\[.5em] Z &:= Y + \epsilon_Z \enspace , \end{aligned} \]

  • with \(\epsilon_X, \epsilon_Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1)\), \(\epsilon_Z \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 0.1)\), and \(\epsilon_X \perp \epsilon_Y \perp \epsilon_Z\)

  • Note the factorization of the joint distribution:

\[ p(X_1, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \hspace{6em} p(X, Y, Z) = p(Z \mid Y) \, p(Y \mid X) \, p(X) \]

Structural Causal Models

set.seed(1)

n <- 100
x <- rnorm(n, 0, 1)
y <- x + rnorm(n, 0, 1)
z <- y + rnorm(n, 0, 0.1)

Structural Causal Models

Average causal effect

  • We define the average causal effect (ACE) as

\[ ACE(Z \rightarrow Y) = \mathbb{E}\left[Y \mid do(Z = z + 1) \right] - \mathbb{E}\left[Y \mid do(Z = z) \right] \enspace . \]

  • For linear Gaussian SCMs, the expectations are easy to evaluate
  • In our example, the average causal effect \(Z \rightarrow Y\) is given by:

\[ ACE(Z \rightarrow Y) = \mathbb{E}[X + \epsilon_Y] - \mathbb{E}[X + \epsilon_Y] = 0 \enspace . \]

  • On the other hand, the average causal effect \(X \rightarrow Y\) is given by:

\[ \begin{aligned} ACE(X \rightarrow Y) &= \mathbb{E}[X + 1 + \epsilon_Y] - \mathbb{E}[X + \epsilon_Y] \\[0.50em] &= 1 + \mathbb{E}[X + \epsilon_Y] - \mathbb{E}[X + \epsilon_Y] \\[0.50em] &= 1 \enspace . \end{aligned} \]

  • Can extent this beyond the first moment (e.g., Gische & Völkle 2020)

Recap and resources

  • Causal DAGs allow us to be explicit about our causal assumptions
    • Assuming invariance, causal sufficiency, and no ‘fat hand’, we can derive valid adjustment sets
    • This provides a path toward drawing causal conclusions from observational data

  • Structural Causal Models (SCMs) are the soul of (this approach to) causal inference
    • Parameterize DAGs and allow us to make quantitative statements
    • They encode observational as well as interventional distributions

  • Causal inference books freely available
    • Hernán & Robins (2020)
    • Peters, Janzing, & Schölkopf (2017)
    • Pearl, Glymour, & Jewell (2016)
    • Pearl & Mackenzie (2018)

Causal Diagrams course by Miguel Hernán

Exercises I

Causal discovery

Outline

  • 1) Core concepts
    • Causality versus causal inference
    • Relating causal to probabilistic statements (DAGs)
    • Formalizing interventions, causal effects, and confounding
    • Simpson’s paradox
    • Structural Causal Models

  • 2) Exercises I

  • 3) Causal discovery
    • PC Algorithm
    • Invariant causal prediction
    • Restricted SCMs
    • Synthetic control
    • Interrupted Time-series

  • 4) Exercises II

Causal discovery

  • We focus on purely observational data for now
  • Given a \(n\) observations \(\mathbf{X} = (X_1, X_2, \ldots, X_n)\) can we learn the underlying causal graph?

  • Broadly speaking, there are three approaches to learning DAGs
    • Constraint-based methods (use independence relations)
    • Score-based methods (minimize some quantity over all graphs)
    • Hybrid methods (combination of both)

  • We focus on the most prominent constraint-based method, the PC-Algorithm
    • Named after Peter Spirtes & Clark Glymour (Spirtes et al., 2000)

DAG learning: First problem

  • Given a \(n\) observations \(\mathbf{X} = (X_1, X_2, \ldots, X_n)\) can we learn the underlying causal graph?

DAG learning: Second problem

PC Algorithm

  • Uses independencies to learn about the DAG (Kalisch & Bühlmann, 2007; Kalisch et al., 2012)
  • Recall the causal Markov condition: for any nodes \(X\), \(Y\), and \(Z\) we have that

\[ X \perp_{\mathcal{G}} Y \mid Z \Rightarrow X \perp_{\mathcal{P}} Y \mid Z \]

  • Many causal discovery methods (including the PC-Algorithm) further assume faithfulness:

\[ X \perp_{\mathcal{P}} Y \mid Z \Rightarrow X \perp_{\mathcal{G}} Y \mid Z \]

  • Faithfulness allows infering causal relations (depicted \(\mathcal{G}\)) from probabilistic associations in the data
  • The PC-Algorithm further assumes no hidden or selection variables

PC Algorithm

  • We learn a whole class of Markov equivalent DAGs that represent the same dependencies
  • Two DAGs encode the same conditional independencies if and only if (e.g., Verma & Pearl, 1991):
    • They have the same skeleton
    • They have the same v-structures

  • So-called Completed Partially Directed Acyclic Graphs (CPDAGs) encode the equivalence class
  • Below are the four DAGs

PC Algorithm

  • We learn a whole class of Markov equivalent DAGs that represent the same dependencies
  • Two DAGs encode the same conditional independencies if and only if (e.g., Verma & Pearl, 1991):
    • They have the same skeleton
    • They have the same v-structures

  • So-called Completed Partially Directed Acyclic Graphs (CPDAGs) encode the equivalence class
  • Here are the skeletons of the four DAGs (simply remove arrows)

PC Algorithm

  • We learn a whole class of Markov equivalent DAGs that represent the same dependencies
  • Two DAGs encode the same conditional independencies if and only if (e.g., Verma & Pearl, 1991):
    • They have the same skeleton
    • They have the same v-structures

  • So-called Completed Partially Directed Acyclic Graphs (CPDAGs) encode the equivalence class
  • Finally, here are the CPDAGs

PC Algorithm

  • The PC-Algorithm proceeds in two steps:
    • First, we estimate the skeleton of the DAG
    • Second, we orient edges if possible

  • Operates according to two principles:
    • There is an edge \(X_i - X_j\) if and only if \(X_i \not \perp X_j \mid S\) for all \(S \subseteq V\setminus \{X_i, X_j\}\)
    • If \(X_i - X_j - X_k\), orient edges \(X_i \rightarrow X_j \leftarrow X_k\) iff \(X_i \not \perp X_k \mid S\) for all \(S\) where \(X_j \subset S\)
      • In other words, we only orient edges if we can identify a collider

PC Algorithm

  • Algorithm works as follows:

    1. Create a fully connected undirected network \(\mathcal{G}\)
    2. For every pair of vertices \((X_i, X_j)\), test whether \(X_i \perp X_j\)
      • If independence holds, remove edge \(X_i - X_j\) from \(\mathcal{G}\)
    3. For every remaining edge with \(X_k\) being connected to \(X_i\) or \(X_j\)
      • Test whether \(X_i \perp X_j \mid X_k\)
      • If independence holds, remove edge \(X_i - X_j\) from \(\mathcal{G}\)
    4. Do this until all neighbours of both \(X_i\) and \(X_j\) are exhausted
      • For example, test \(X_i \perp X_j \mid X_k, X_l\) etc. and remove edge if independence holds
    5. Orient edges when a collider is identified
    6. If applicable, orient more edges that are logically implied using Meek’s rules (Meek, 1995)

PC Algorithm: Example 1

  • Suppose the true DAG is \(X \rightarrow Z \rightarrow Y\)
  • The PC-Algorithm would proceed as follows

PC Algorithm: Example 1

library('pcalg')
set.seed(1)

n <- 1000
X <- rnorm(n)
Z <- X + rnorm(n)
Y <- Z + rnorm(n)

suffStat <- list('C' = cor(cbind(X, Y, Z)), 'n' = n)
fit <- pc(suffStat = suffStat, indepTest = gaussCItest, p = 3, alpha = 0.01)

plot(fit, main = '')

PC Algorithm: Example 2

  • Suppose the true DAG is \(X \rightarrow Z \leftarrow Y\)
  • The PC-Algorithm would proceed as follows

PC Algorithm: Example 2

library('pcalg')
set.seed(1)

n <- 1000
X <- rnorm(n)
Y <- rnorm(n)
Z <- X + Y + rnorm(n)

suffStat <- list('C' = cor(cbind(X, Y, Z)), 'n' = n)
fit <- pc(suffStat = suffStat, indepTest = gaussCItest, p = 3, alpha = 0.01)

plot(fit, main = '')

PC Algorithm: Issues

  • Assumes an oracle that can correctly tell us all possible conditional independencies
  • This is usually not the case, and we rely on standard conditional independence tests from statistics
  • No uncertainty quantification

  • Faithfulness may not hold in contexts where we estimate population parameters (Uhler et al., 2013)

  • PC algorithm cannot deal with hidden or selection variables (Kalisch et al., 2012)

  • Problems in practice (see e.g., Ramsey et al., 2011, for neuroscience context)
    • Indirect measurements and measurement error (see also Westfall & Yarkoni, 2016)
    • What is the ‘correct’ granularity for causal variables? (Eberhardt, 2016; Weichwald & Peters, 2020)
      • For example, ‘bad’ versus ‘good’ cholesterol
David Kinney on granularity, Santa Fe Podcast E19

Invariant Causal Prediction (ICP)

  • Given \(n\) observations \((Y, X_1, \ldots, X_n)\) can we learn the causal parents of \(Y\), \(\text{pa}^{\mathcal{G}}(Y)\)?
    • Yes, if we have data from different environments \(e \in \mathcal{E}\)
    • These can be observational and interventional, as long as we never intervene on \(Y\) directly

  • ICP exploits the invariance assumption that underlies causal inference (Peters et al. 2017)
  • In particular, we have that the conditional distribution \(p(Y \mid \text{pa}^{\mathcal{G}}(Y))\) is invariant across \(\mathcal{E}\)

  • For linear Gaussian SCMs

\[ Y^e := \mu + \sum_{i \, \in \, \text{pa}^{\mathcal{G}}(Y)} \beta_i^e \cdot X_i^e + \varepsilon_Y^e \hspace{3em} \varepsilon_Y^e \perp (X_i, \ldots X_{|\text{pa}^{\mathcal{G}}(Y)|}) \enspace , \]

  • this means that \(\beta_i^e\) and \(\sigma\left(\varepsilon^e_Y\right)\) are the same across all \(e \in \mathcal{E}\)

Invariant Causal Prediction

  • Suppose we have \(n\) observations from \((X, Y, Z)\) under two different environments \(e \in \mathcal{E} = \{0, 1\}\)

\[ \begin{aligned} X^1 &:= \varepsilon_{X^1} \\[0.50em] Y^1 &:= X^1 + \varepsilon_{Y^1} \\[0.50em] Z^1 &:= Y^1 + \varepsilon_{Z^1} \\[0.50em] \end{aligned} \]

\[ \begin{aligned} X^2 &:= 2 + \varepsilon_{X^2}\\[0.50em] Y^2 &:= X^2 + \varepsilon_{Y^2} \\[0.50em] Z^2 &:= Y^1 + \varepsilon_{Z^2} \\[0.50em] \end{aligned} \]

Invariant Causal Prediction

  • Suppose we have \(n\) observations from \((X, Y, Z)\) under two different environments \(e \in \mathcal{E} = \{0, 1\}\)

Invariant Causal Prediction

library('InvariantCausalPrediction')
set.seed(1); n <- 500

X <- c(rnorm(n), 2 + rnorm(n))
Y <- X + rnorm(n)
Z <- Y + rnorm(n)

X <- cbind(X, Z); colnames(X) <- c('X', 'Z'); indicator <- rep(1:2, each = n)
ICP(X, Y, indicator, test = 'exact', showCompletion = FALSE)
## 
##  accepted set of variables 1
##  accepted set of variables 1,2
## 
##  Invariant Linear Causal Regression at level 0.01 (including multiplicity correction for the number of variables)
##  Variable X shows a significant causal effect
##  
##     LOWER BOUND  UPPER BOUND  MAXIMIN EFFECT  P-VALUE    
## X         0.44         1.04            0.44  0.00048 ***
## Z         0.00         0.52            0.00  1.00000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Invariant Causal Prediction

  • Does not learn the whole DAG (which is too ambitious anyway)
  • Allows us to learn the causal parents of an outcome using data from different environments
  • Does not require faithfulness, and comes with uncertainty quantification and error control

  • Extensions include
    • Accounting for hidden variables
    • Nonlinear SCMs (Heinze-Deml, Peters, & Meinshausen 2018)
    • Time-series data (Pfister, Bühlmann, & Peters, 2018)

  • For an application to neuroscience, see Weichwald & Peters (2020)

Outline

  • 1) Core concepts
    • Causality versus causal inference
    • Relating causal to probabilistic statements (DAGs)
    • Formalizing interventions, causal effects, and confounding
    • Simpson’s paradox
    • Structural Causal Models

  • 2) Exercises I

  • 3) Causal discovery
    • PC Algorithm
    • Invariant causal prediction
    • Restricted SCMs
    • Synthetic control
    • Interrupted Time-series

  • 4) Exercises II

Learning Cause-Effect Pairs

  • Conditional independence test methods require at least three variables to be applicable
  • This means they cannot tell us whether \(X \rightarrow Y\) or \(Y \rightarrow X\)

Learning Cause-Effect Pairs

  • Conditional independence test methods require at least three variables to be applicable
  • This means they cannot tell us whether \(X \rightarrow Y\) or \(Y \rightarrow X\)

The problem

  • Suppose we have the following SCM

\[ \begin{aligned} X &:= \varepsilon_X \\ Y &:= f(X, \varepsilon_Y) \\ \end{aligned} \]

  • with \(X \perp \varepsilon_Y\) and \(\varepsilon_X \perp \varepsilon_Y\)

  • The problem is that there is an equivalent SCM of the form

\[ \begin{aligned} Y &:= \varepsilon_y \\ X &:= g(Y, \varepsilon_X) \\ \end{aligned} \]

  • with \(Y \perp \varepsilon_X\) and \(\varepsilon_X \perp \varepsilon_Y\) (e.g., Spirtes & Zhang, 2016)

  • Thus we cannot distinguish \(X \rightarrow Y\) from \(Y \rightarrow X\) when allowing these flexible SCMs
  • Assumptions about \(f\) and the distribution of \(\varepsilon\) can help us uncover the direction!
  • In addition to assumming no confounding and no selection bias (Mooij et al. 2016)

Linear Non-Gaussian Acyclic Models (LiNGAMs)

  • If \(f\) is a linear function and \(\varepsilon\) non-gaussian, causal direction is identified (Shimizu et al., 2006)

Spirtes & Zhang (2016)

Nonlinear Gaussian Additive Noise Model

  • If \(f\) is a nonlinear function and \(\varepsilon\) Gaussian, causal direction is identified (Hoyer et al., 2009, Peters et al., 2014)

Cause-Effect pairs data base

Mooij et al. (2016)

Cause-Effect pairs: Example I

library('pcalg')
options(scipen = 99, digits = 4)

url <- 'https://webdav.tuebingen.mpg.de/cause-effect/'
dat <- read.table(paste0(url, 'pair0064.txt'), col.names = c('drinking_water_access', 'infant_mortality'))

lingam(dat)$Bpruned
##        [,1] [,2]
## [1,]  0.000    0
## [2,] -1.817    0

Cause-Effect pairs: Example II

library('mgcv')
library('dHSIC')
options(scipen = 99, digits = 4)

url <- 'https://webdav.tuebingen.mpg.de/cause-effect/'
dat <- read.table(paste0(url, 'pair0038.txt'), col.names = c('age', 'bmi'))

m1 <- gam(bmi ~ s(age), data = dat)
m2 <- gam(age ~ s(bmi), data = dat)

Cause-Effect pairs: Example II

library('mgcv')
library('dHSIC')
options(scipen = 99, digits = 4)

url <- 'https://webdav.tuebingen.mpg.de/cause-effect/'
dat <- read.table(paste0(url, 'pair0038.txt'), col.names = c('age', 'bmi'))

m1 <- gam(bmi ~ s(age), data = dat)
m2 <- gam(age ~ s(bmi), data = dat)

c(
  dhsic.test(resid(m1), dat$age, method = 'gamma')$p.value, # Age \perp error
  dhsic.test(resid(m2), dat$bmi, method = 'gamma')$p.value  # BMI \perp error
)
## [1] 0.12313742 0.00002698

Synthetic control: Motivation

Synthetic control: Motivation

Synthetic control: Motivation

  • Can’t say that face masks curb infections from these data
  • The problem is that there are likely many confounding variables
  • We could try to find a suitable control group. Maybe compare Japan to Iran?
  • But this choice is very subjective (and obviously incorrect)

  • The synthetic control method provides a data-driven solution (Abadie et al., 2010; Abadie et al., 2011)

  • Compare a treated unit \(Y\) against a synthetic unit \(Y_s\)
    • \(Y_s\) is a weighted average of control units
    • It best approximates the most relevant characteristics of \(Y\) prior to treatment

Synthetic control: Motivation

  • Observe \(j = 1, \ldots, J + 1\) units for time \(t = 1, \ldots, T\)
  • Suppose that \(j = 1\) is the treated unit \(j = 2, \ldots, J + 1\) are the control units
  • Some policy takes place at time point \(T_0\), so that \(t = 1, \ldots, T_0\) is pre-intervention
  • Define two potential outcomes
    • \(Y_{1t}^N\) is the value of the treated unit at \(t\) where no intervention took place
    • \(Y_{1t}^I\) is the value of the treated unit at \(t\) where an intervention took place
    • The treatment effect is given by \(\alpha_t =Y_{1t}^I - Y_{1t}^N\)

  • The issue is, of course, that we don’t observe \(Y_{1t}^N\) for \(t > T_0\)
  • The synthetic control method aims to best approximate \(Y_{1t}^N\) for \(t > T_0\)

Synthetic control: Face masks

  • Mitze et al. (2020) apply the synthetic control method to face masks
  • Used a weighted combination of covariates from 409 regions to build the synthetic control

Synthetic control: Face masks

  • Mitze et al. (2020) apply the synthetic control method to face masks
  • Used a weighted combination of covariates from 409 regions to build the synthetic control

Synthetic control: Tesla

Synthetic control: CausalImpact

Synthetic control: Tesla

  • Use the S&P500 as a synthetic control

Interrupted Time-series Analysis (ITS)

  • Similar to synthetic control, except that we do not use a control unit (Bernal et al., 2017, 2019)
  • Leefting & Hinne (2020) apply it to Kundalini Yoga

Happiness / Sadness across the world

Recap

  • Causal discovery is very ambitious
  • PC algorithm learns Markov equivalent DAGs from observational data
    • Assumes faithfulness, no hidden and no selection variables
    • Faithfulness is a strong assumption, conditional independence testing is hard

  • ICP uses data from different environments to learn causal parents
    • Assumes invariant mechanisms (standard in causal inference)
    • Does not require faithfulness, can be extended to nonlinear case, hidden variables, time-series

  • Restricted SCMs learn the causal direction (\(X \rightarrow Y\) or \(Y \rightarrow X\))
    • LiNGAMs assume linear function with non-Gaussian error
    • Nonlinear Gaussian additive noise models assume non-linear function with Gaussian error

  • Synthetic control methods create a control to learn causal effect (usually after some intervention)
  • Interrupted time-series similar in spirit, except no synthetic control

Exercises II