Causal Inference

12th January, 2021

Causal inference

In all previous examples, we can achieve good to excellent predictions
But we do not merely want to predict systems, but also change them!

Causal inference goes beyond prediction by modeling the outcome of interventions
- Ask not what \(Y\) is likely to be if \(X\) happened to be \(x\)
- Ask what \(Y\) is likely to be if \(X\) were set to \(x\)

It provides tools that allow us to draw causal conclusions from observational data
- Because randomized experiments are often infeasible, unethical, or impossible

Causal inference

In all previous examples, we can achieve good to excellent predictions
But we do not merely want to predict systems, but also change them!

Causal inference goes beyond prediction and models the outcome of interventions
It provides tools that allow us to draw causal conclusions from observational data

Causal inference

In all previous examples, we can achieve good to excellent predictions
But we do not merely want to predict systems, but also change them!

Causal inference goes beyond prediction and models the outcome of interventions
It provides tools that allow us to draw causal conclusions from observational data

Causal inference

In all previous examples, we can achieve good to excellent predictions
But we do not merely want to predict systems, but also change them!

Causal inference goes beyond prediction and models the outcome of interventions
It provides tools that allow us to draw causal conclusions from observational data

Outline

1) Core concepts
- Causality versus causal inference
- Relating causal to probabilistic statements (DAGs)
- Formalizing interventions, causal effects, and confounding
- Simpson’s paradox
- Structural Causal Models

2) Exercises I

3) Causal discovery
- PC Algorithm
- Invariant causal prediction
- Restricted SCMs
- Synthetic control
- Interrupted Time-series

4) Exercises II

Core concepts

Causality versus Causal inference

David Hume defined a cause “[…] to be an object, followed by another, and where all the objects similar to the first, are followed by objects similar to the second.” (Hume, 1748, section VII)
A key problem is that most causes are not invariably followed by their effects
- Not everybody who smokes gets lung cancer
- Instead, smoking increases the probability of getting lung cancer

Causality is a big topic in philosophy
We are interested here in causal inference
- Identifying and estimating causal effects in the population
- Causal effects = numerical quantities measuring changes in the distribution of an outcome under different interventions (e.g., Hernán & Robins, 2020)

Linking association to causation

Reichenbach’s common cause principle: “If two random variables \(X\) and \(Y\) are statistically dependent (\(X \not \perp Y\)), then either (a) \(X\) causes \(Y\), (b) \(Y\) causes \(X\), or (c) there exists a third variable \(Z\) that causes both \(X\) and \(Y\). Further, \(X\) and \(Y\) become independent given \(Z\), i.e., \(X \perp Y \mid Z\).”

Probabilistic independence of \(X\) and \(Y\) means that
- \(P(Y, X) = P(Y \mid X) \, P(X) = P(Y) \, P(X)\)
Probabilistic independence of \(X\) and \(Y\) given \(Z\) means that
- \(P(Y, X, Z) = P(Y \mid Z, X) \, P(Z, X) = P(Y \mid Z) \, P(Z, X)\)

The language of probability is limiting
We cannot, for example, state that rain causes the streets to be wet
We use Directed Acyclic Graphs (DAGs) to express such causal statements

Three fundamental structures

DAGs depict causal relations and imply certain (conditional) independencies

Three fundamental structures

DAGs depict causal relations and imply certain (conditional) independencies

Three fundamental structures

DAGs depict causal relations and imply certain (conditional) independencies

Large DAGs and d-separation

d-separation allows us to read off (conditional) independencies from any DAG
We need to define a few concepts first:
- A path from \(X\) to \(Y\) is a sequence of nodes & edges such that the start & end nodes are \(X\) and \(Y\)
- A conditioning set \(\mathcal{L}\) is the set of nodes we condition on (it can be empty)
- Conditioning on a non-collider along a path blocks that path
- A collider along a path blocks that path
- Conditioning on a collider or any of its descendants unblocks a path (e.g., \(X \rightarrow W \leftarrow Y\))

Two nodes \(X\) and \(Y\) are d-separated by \(\mathcal{L}\) if and only if members of \(\mathcal{L}\) block all paths between \(X\) and \(Y\)

Linking independence models

d-separation gives us an indendence model \(\perp_{\mathcal{G}}\) that is defined on graphs
Probability theory gives us an independence model \(\perp_{\mathcal{P}}\) that is defined on random variables

The causal Markov condition relates the two:

\[ X \perp_{\mathcal{G}} Y \mid Z \Rightarrow X \perp_{\mathcal{P}} Y \mid Z \enspace . \]

If nodes \(X\) and \(Y\) are \(d\)-separated by \(Z\), then they are probabilistically independent given \(Z\)
This condition implies (and is implied by) the following factorization:

\[ p(X_1, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \enspace , \]

where \(n\) is the number of nodes and \(\text{pa}^{\mathcal{G}}(X_i)\) are the parents of node \(X_i\)
In other words, a node is independent of its non-descendants given its parents

Formalizing interventions

\(p(Y \mid X = x)\) describes what values of \(Y\) are likely if \(X\) happened to be x
- We call this the observational distribution
\(p(Y \mid do(X = x))\) describes what values of \(Y\) are likely if \(X\) is set to x
- We call this the interventional distribution

Formalizing interventions

\(p(Y \mid X = x)\) describes what values of \(Y\) are likely if \(X\) happened to be x
- We call this the observational distribution
\(p(Y \mid do(X = x))\) describes what values of \(Y\) are likely if \(X\) is set to x
- We call this the interventional distribution

Formalizing interventions

An intervention \(do(X = x)\) implies cutting all incoming edges to \(X\)

Formalizing interventions

\(p(Y \mid do(X = x))\) is observational distribution \(p_m(Y \mid X = x)\) in manipulated DAG
For the left-most and right-most DAG, we have that \(p(Y \mid do(X = x)) = p(Y \mid X = x)\)
For the two middle DAGs, we have to do some work:

\[ \begin{aligned} p(Y = y \mid do(X = x)) &= p_m(Y = y \mid X = x) \\[0.50em] &= \sum_{z} p_m(Y = y, Z = z \mid X = x) \hspace{5.25em} \text{... Sum rule} \\[0.50em] &= \sum_{z} p_m(Y = y \mid X = x, Z = z) \, p_m(Z = z) \hspace{1.45em} \text{... Product rule} \\[0.50em] &= \sum_{z} p(Y = y \mid X = x, Z = z) \, p(Z = z) \hspace{2.5em} \text{... Assumptions} \end{aligned} \]

Assumption I: Mechanism is independent of whether \(X = x\) or \(do(X = x)\) (invariance)
Assumption II: Interventions only change a single node (no ‘fat hand’)
Assumption III: There is no unobserved confounding (causal sufficiency)

Confounding and valid adjustment

The causal effect of \(X\) on \(Y\) is confounded if \(p(Y \mid do(X = x)) \neq p(Y \mid X = x)\)
In observational data, there will always be confounding factors
- What variables should we adjust for?
- This requires knowledge about the underlying DAG
- (Don’t just adjust for all variables — can induce bias!)

One valid adjustment sets are the parents of \(X\)
Another one is given by the backdoor criterion (e.g., Pearl, Glymour, & Jewell, 2016, p. 61)

Rationale:
- We block all spurious paths between \(X\) and \(Y\)
- We leave all directed paths from \(X\) to \(Y\) unperturbed
- We create no new spurious paths

Recap

Causal inference is distinct from causality and requires a language that goes beyond association
Directed Acyclic Graphs (DAGs) provide us with a way to state causal relations

Reichenbach’s common cause principle links associations to causation
It is implied by the causal Markov condition (e.g., Peters et al., 2017, p. 104), which
- implies the factorization \(p(X_1, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i))\)
- relates causal relationships to probabilistic relationships \(X \perp_{\mathcal{G}} Y \mid Z \Rightarrow X \perp_{\mathcal{P}} Y \mid Z\)

Three fundamental DAG structures: chain, fork, collider
For larger DAGs, \(d\)-separation helps us to find conditional independencies

The \(do\)-calculus formalizes interventions
Confounding occurs if \(p(Y \mid do(X = x)) \neq p(Y \mid X = x)\)
DAGs help us find adjustment sets that unconfound the causal effect, in case they exist

\[ p(Y = y \mid do(X = x)) = \sum_{z} p(Y = y \mid X = x, Z = z) \, p(Z = z) \]

Simpson’s paradox

350 patients chose to take a drug and 350 chose not to
Given these data, is the drug helpful or harmful?
Should a doctor prescribe the drug to a patient?

Simpson’s paradox

350 patients chose to take a drug and 350 chose not to
Given these data, is the drug helpful or harmful?
Should a doctor prescribe the drug to a patient?

Simpson’s paradox

The data are exactly the same in both cases
Statistics alone cannot provide an answer
We need to have an understanding of the causal story behind these data
We can visualize the causal relations using DAGs

Simpson’s paradox

Suppose we know that estrogen has a negative effect on recovery
Note also that more women choose the drug than men
Therefore, being a woman has an effect on drug taking as well as recovery
- Should condition on gender!
- This blocks the backdoor path \(D \leftarrow G \rightarrow R\)
- And therefore unconfounds the effect \(D \rightarrow R\)

Simpson’s paradox

Blood pressure is measured after taking the drug
- It cannot cause choosing the drug
- Instead, it is a mechanism of how the drug works
- Should not condition on blood pressure!

Structural Causal Models

Structural Causal Models (SCM) as the fundamental building blocks of causal inference
- We understand relations between variables in a SCM to be causal
- Here, we will assume acyclic, linear SCMs with Gaussian error terms
- The relations in such SCMs can be visualized in DAGs

Example:

\[ \begin{aligned} X &:= \epsilon_X \\[.5em] Y &:= X + \epsilon_Y \\[.5em] Z &:= Y + \epsilon_Z \enspace , \end{aligned} \]

with \(\epsilon_X, \epsilon_Y \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 1)\), \(\epsilon_Z \stackrel{\text{iid}}{\sim} \mathcal{N}(0, 0.1)\), and \(\epsilon_X \perp \epsilon_Y \perp \epsilon_Z\)

Note the factorization of the joint distribution:

\[ p(X_1, \ldots, X_n) = \prod_{i=1}^n p(X_i \mid \text{pa}^{\mathcal{G}}(X_i)) \hspace{6em} p(X, Y, Z) = p(Z \mid Y) \, p(Y \mid X) \, p(X) \]

Structural Causal Models

set.seed(1)

n <- 100
x <- rnorm(n, 0, 1)
y <- x + rnorm(n, 0, 1)
z <- y + rnorm(n, 0, 0.1)

Structural Causal Models

Average causal effect

We define the average causal effect (ACE) as

\[ ACE(Z \rightarrow Y) = \mathbb{E}\left[Y \mid do(Z = z + 1) \right] - \mathbb{E}\left[Y \mid do(Z = z) \right] \enspace . \]

For linear Gaussian SCMs, the expectations are easy to evaluate
In our example, the average causal effect \(Z \rightarrow Y\) is given by:

\[ ACE(Z \rightarrow Y) = \mathbb{E}[X + \epsilon_Y] - \mathbb{E}[X + \epsilon_Y] = 0 \enspace . \]

On the other hand, the average causal effect \(X \rightarrow Y\) is given by:

\[ \begin{aligned} ACE(X \rightarrow Y) &= \mathbb{E}[X + 1 + \epsilon_Y] - \mathbb{E}[X + \epsilon_Y] \\[0.50em] &= 1 + \mathbb{E}[X + \epsilon_Y] - \mathbb{E}[X + \epsilon_Y] \\[0.50em] &= 1 \enspace . \end{aligned} \]

Can extent this beyond the first moment (e.g., Gische & Völkle 2020)

Recap and resources

Causal DAGs allow us to be explicit about our causal assumptions
- Assuming invariance, causal sufficiency, and no ‘fat hand’, we can derive valid adjustment sets
- This provides a path toward drawing causal conclusions from observational data

Structural Causal Models (SCMs) are the soul of (this approach to) causal inference
- Parameterize DAGs and allow us to make quantitative statements
- They encode observational as well as interventional distributions

Causal inference books freely available
- Hernán & Robins (2020)
- Peters, Janzing, & Schölkopf (2017)
- Pearl, Glymour, & Jewell (2016)
- Pearl & Mackenzie (2018)

Exercises I

Causal discovery

Outline

1) Core concepts
- Causality versus causal inference
- Relating causal to probabilistic statements (DAGs)
- Formalizing interventions, causal effects, and confounding
- Simpson’s paradox
- Structural Causal Models

2) Exercises I

3) Causal discovery
- PC Algorithm
- Invariant causal prediction
- Restricted SCMs
- Synthetic control
- Interrupted Time-series

4) Exercises II

Causal discovery

We focus on purely observational data for now
Given a \(n\) observations \(\mathbf{X} = (X_1, X_2, \ldots, X_n)\) can we learn the underlying causal graph?

Broadly speaking, there are three approaches to learning DAGs
- Constraint-based methods (use independence relations)
- Score-based methods (minimize some quantity over all graphs)
- Hybrid methods (combination of both)

We focus on the most prominent constraint-based method, the PC-Algorithm
- Named after Peter Spirtes & Clark Glymour (Spirtes et al., 2000)

DAG learning: First problem

Given a \(n\) observations \(\mathbf{X} = (X_1, X_2, \ldots, X_n)\) can we learn the underlying causal graph?

DAG learning: Second problem

Given a \(n\) observations \(\mathbf{X} = (X_1, X_2, \ldots, X_n)\) can we learn the underlying causal graph?

PC Algorithm

Uses independencies to learn about the DAG (Kalisch & Bühlmann, 2007; Kalisch et al., 2012)
Recall the causal Markov condition: for any nodes \(X\), \(Y\), and \(Z\) we have that

\[ X \perp_{\mathcal{G}} Y \mid Z \Rightarrow X \perp_{\mathcal{P}} Y \mid Z \]

Many causal discovery methods (including the PC-Algorithm) further assume faithfulness:

\[ X \perp_{\mathcal{P}} Y \mid Z \Rightarrow X \perp_{\mathcal{G}} Y \mid Z \]

Faithfulness allows infering causal relations (depicted \(\mathcal{G}\)) from probabilistic associations in the data
The PC-Algorithm further assumes no hidden or selection variables

PC Algorithm

We learn a whole class of Markov equivalent DAGs that represent the same dependencies
Two DAGs encode the same conditional independencies if and only if (e.g., Verma & Pearl, 1991):
- They have the same skeleton
- They have the same v-structures

So-called Completed Partially Directed Acyclic Graphs (CPDAGs) encode the equivalence class
Below are the four DAGs

PC Algorithm

We learn a whole class of Markov equivalent DAGs that represent the same dependencies
Two DAGs encode the same conditional independencies if and only if (e.g., Verma & Pearl, 1991):
- They have the same skeleton
- They have the same v-structures

So-called Completed Partially Directed Acyclic Graphs (CPDAGs) encode the equivalence class
Here are the skeletons of the four DAGs (simply remove arrows)

PC Algorithm

We learn a whole class of Markov equivalent DAGs that represent the same dependencies
Two DAGs encode the same conditional independencies if and only if (e.g., Verma & Pearl, 1991):
- They have the same skeleton
- They have the same v-structures

So-called Completed Partially Directed Acyclic Graphs (CPDAGs) encode the equivalence class
Finally, here are the CPDAGs

PC Algorithm

The PC-Algorithm proceeds in two steps:
- First, we estimate the skeleton of the DAG
- Second, we orient edges if possible

Operates according to two principles:
- There is an edge \(X_i - X_j\) if and only if \(X_i \not \perp X_j \mid S\) for all \(S \subseteq V\setminus \{X_i, X_j\}\)
- If \(X_i - X_j - X_k\), orient edges \(X_i \rightarrow X_j \leftarrow X_k\) iff \(X_i \not \perp X_k \mid S\) for all \(S\) where \(X_j \subset S\)
  - In other words, we only orient edges if we can identify a collider

PC Algorithm

Algorithm works as follows:
1. Create a fully connected undirected network \(\mathcal{G}\)
2. For every pair of vertices \((X_i, X_j)\), test whether \(X_i \perp X_j\)
  - If independence holds, remove edge \(X_i - X_j\) from \(\mathcal{G}\)
3. For every remaining edge with \(X_k\) being connected to \(X_i\) or \(X_j\)
  - Test whether \(X_i \perp X_j \mid X_k\)
  - If independence holds, remove edge \(X_i - X_j\) from \(\mathcal{G}\)
4. Do this until all neighbours of both \(X_i\) and \(X_j\) are exhausted
  - For example, test \(X_i \perp X_j \mid X_k, X_l\) etc. and remove edge if independence holds
5. Orient edges when a collider is identified
6. If applicable, orient more edges that are logically implied using Meek’s rules (Meek, 1995)

PC Algorithm: Example 1

Suppose the true DAG is \(X \rightarrow Z \rightarrow Y\)
The PC-Algorithm would proceed as follows

PC Algorithm: Example 1

library('pcalg')
set.seed(1)

n <- 1000
X <- rnorm(n)
Z <- X + rnorm(n)
Y <- Z + rnorm(n)

suffStat <- list('C' = cor(cbind(X, Y, Z)), 'n' = n)
fit <- pc(suffStat = suffStat, indepTest = gaussCItest, p = 3, alpha = 0.01)

plot(fit, main = '')

PC Algorithm: Example 2

Suppose the true DAG is \(X \rightarrow Z \leftarrow Y\)
The PC-Algorithm would proceed as follows

PC Algorithm: Example 2

library('pcalg')
set.seed(1)

n <- 1000
X <- rnorm(n)
Y <- rnorm(n)
Z <- X + Y + rnorm(n)

suffStat <- list('C' = cor(cbind(X, Y, Z)), 'n' = n)
fit <- pc(suffStat = suffStat, indepTest = gaussCItest, p = 3, alpha = 0.01)

plot(fit, main = '')

PC Algorithm: Issues

Assumes an oracle that can correctly tell us all possible conditional independencies
This is usually not the case, and we rely on standard conditional independence tests from statistics
No uncertainty quantification

Faithfulness may not hold in contexts where we estimate population parameters (Uhler et al., 2013)

PC algorithm cannot deal with hidden or selection variables (Kalisch et al., 2012)

Problems in practice (see e.g., Ramsey et al., 2011, for neuroscience context)
- Indirect measurements and measurement error (see also Westfall & Yarkoni, 2016)
- What is the ‘correct’ granularity for causal variables? (Eberhardt, 2016; Weichwald & Peters, 2020)
  - For example, ‘bad’ versus ‘good’ cholesterol

Invariant Causal Prediction (ICP)

Given \(n\) observations \((Y, X_1, \ldots, X_n)\) can we learn the causal parents of \(Y\), \(\text{pa}^{\mathcal{G}}(Y)\)?
- Yes, if we have data from different environments \(e \in \mathcal{E}\)
- These can be observational and interventional, as long as we never intervene on \(Y\) directly

ICP exploits the invariance assumption that underlies causal inference (Peters et al. 2017)
In particular, we have that the conditional distribution \(p(Y \mid \text{pa}^{\mathcal{G}}(Y))\) is invariant across \(\mathcal{E}\)
For linear Gaussian SCMs

\[ Y^e := \mu + \sum_{i \, \in \, \text{pa}^{\mathcal{G}}(Y)} \beta_i^e \cdot X_i^e + \varepsilon_Y^e \hspace{3em} \varepsilon_Y^e \perp (X_i, \ldots X_{|\text{pa}^{\mathcal{G}}(Y)|}) \enspace , \]

this means that \(\beta_i^e\) and \(\sigma\left(\varepsilon^e_Y\right)\) are the same across all \(e \in \mathcal{E}\)

Invariant Causal Prediction

Suppose we have \(n\) observations from \((X, Y, Z)\) under two different environments \(e \in \mathcal{E} = \{0, 1\}\)

\[ \begin{aligned} X^1 &:= \varepsilon_{X^1} \\[0.50em] Y^1 &:= X^1 + \varepsilon_{Y^1} \\[0.50em] Z^1 &:= Y^1 + \varepsilon_{Z^1} \\[0.50em] \end{aligned} \]

\[ \begin{aligned} X^2 &:= 2 + \varepsilon_{X^2}\\[0.50em] Y^2 &:= X^2 + \varepsilon_{Y^2} \\[0.50em] Z^2 &:= Y^1 + \varepsilon_{Z^2} \\[0.50em] \end{aligned} \]

Invariant Causal Prediction

Suppose we have \(n\) observations from \((X, Y, Z)\) under two different environments \(e \in \mathcal{E} = \{0, 1\}\)

Invariant Causal Prediction

library('InvariantCausalPrediction')
set.seed(1); n <- 500

X <- c(rnorm(n), 2 + rnorm(n))
Y <- X + rnorm(n)
Z <- Y + rnorm(n)

X <- cbind(X, Z); colnames(X) <- c('X', 'Z'); indicator <- rep(1:2, each = n)
ICP(X, Y, indicator, test = 'exact', showCompletion = FALSE)

## 
##  accepted set of variables 1
##  accepted set of variables 1,2

## 
##  Invariant Linear Causal Regression at level 0.01 (including multiplicity correction for the number of variables)
##  Variable X shows a significant causal effect
##  
##     LOWER BOUND  UPPER BOUND  MAXIMIN EFFECT  P-VALUE    
## X         0.44         1.04            0.44  0.00048 ***
## Z         0.00         0.52            0.00  1.00000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Invariant Causal Prediction

Does not learn the whole DAG (which is too ambitious anyway)
Allows us to learn the causal parents of an outcome using data from different environments
Does not require faithfulness, and comes with uncertainty quantification and error control

Extensions include
- Accounting for hidden variables
- Nonlinear SCMs (Heinze-Deml, Peters, & Meinshausen 2018)
- Time-series data (Pfister, Bühlmann, & Peters, 2018)

For an application to neuroscience, see Weichwald & Peters (2020)

Outline

1) Core concepts
- Causality versus causal inference
- Relating causal to probabilistic statements (DAGs)
- Formalizing interventions, causal effects, and confounding
- Simpson’s paradox
- Structural Causal Models

2) Exercises I

3) Causal discovery
- PC Algorithm
- Invariant causal prediction
- Restricted SCMs
- Synthetic control
- Interrupted Time-series

4) Exercises II

Learning Cause-Effect Pairs

Conditional independence test methods require at least three variables to be applicable
This means they cannot tell us whether \(X \rightarrow Y\) or \(Y \rightarrow X\)

Learning Cause-Effect Pairs

Conditional independence test methods require at least three variables to be applicable
This means they cannot tell us whether \(X \rightarrow Y\) or \(Y \rightarrow X\)

The problem

Suppose we have the following SCM

\[ \begin{aligned} X &:= \varepsilon_X \\ Y &:= f(X, \varepsilon_Y) \\ \end{aligned} \]

with \(X \perp \varepsilon_Y\) and \(\varepsilon_X \perp \varepsilon_Y\)
The problem is that there is an equivalent SCM of the form

\[ \begin{aligned} Y &:= \varepsilon_y \\ X &:= g(Y, \varepsilon_X) \\ \end{aligned} \]

with \(Y \perp \varepsilon_X\) and \(\varepsilon_X \perp \varepsilon_Y\) (e.g., Spirtes & Zhang, 2016)

Thus we cannot distinguish \(X \rightarrow Y\) from \(Y \rightarrow X\) when allowing these flexible SCMs
Assumptions about \(f\) and the distribution of \(\varepsilon\) can help us uncover the direction!
In addition to assumming no confounding and no selection bias (Mooij et al. 2016)

Linear Non-Gaussian Acyclic Models (LiNGAMs)

If \(f\) is a linear function and \(\varepsilon\) non-gaussian, causal direction is identified (Shimizu et al., 2006)

Nonlinear Gaussian Additive Noise Model

If \(f\) is a nonlinear function and \(\varepsilon\) Gaussian, causal direction is identified (Hoyer et al., 2009, Peters et al., 2014)

Cause-Effect pairs data base

Cause-Effect pairs: Example I

library('pcalg')
options(scipen = 99, digits = 4)

url <- 'https://webdav.tuebingen.mpg.de/cause-effect/'
dat <- read.table(paste0(url, 'pair0064.txt'), col.names = c('drinking_water_access', 'infant_mortality'))

lingam(dat)$Bpruned

##        [,1] [,2]
## [1,]  0.000    0
## [2,] -1.817    0

Cause-Effect pairs: Example II

library('mgcv')
library('dHSIC')
options(scipen = 99, digits = 4)

url <- 'https://webdav.tuebingen.mpg.de/cause-effect/'
dat <- read.table(paste0(url, 'pair0038.txt'), col.names = c('age', 'bmi'))

m1 <- gam(bmi ~ s(age), data = dat)
m2 <- gam(age ~ s(bmi), data = dat)

Cause-Effect pairs: Example II

library('mgcv')
library('dHSIC')
options(scipen = 99, digits = 4)

url <- 'https://webdav.tuebingen.mpg.de/cause-effect/'
dat <- read.table(paste0(url, 'pair0038.txt'), col.names = c('age', 'bmi'))

m1 <- gam(bmi ~ s(age), data = dat)
m2 <- gam(age ~ s(bmi), data = dat)

c(
  dhsic.test(resid(m1), dat$age, method = 'gamma')$p.value, # Age \perp error
  dhsic.test(resid(m2), dat$bmi, method = 'gamma')$p.value  # BMI \perp error
)

## [1] 0.12313742 0.00002698

Synthetic control: Motivation

Can’t say that face masks curb infections from these data
The problem is that there are likely many confounding variables
We could try to find a suitable control group. Maybe compare Japan to Iran?
But this choice is very subjective (and obviously incorrect)

The synthetic control method provides a data-driven solution (Abadie et al., 2010; Abadie et al., 2011)

Compare a treated unit \(Y\) against a synthetic unit \(Y_s\)
- \(Y_s\) is a weighted average of control units
- It best approximates the most relevant characteristics of \(Y\) prior to treatment

Synthetic control: Motivation

Observe \(j = 1, \ldots, J + 1\) units for time \(t = 1, \ldots, T\)
Suppose that \(j = 1\) is the treated unit \(j = 2, \ldots, J + 1\) are the control units
Some policy takes place at time point \(T_0\), so that \(t = 1, \ldots, T_0\) is pre-intervention
Define two potential outcomes
- \(Y_{1t}^N\) is the value of the treated unit at \(t\) where no intervention took place
- \(Y_{1t}^I\) is the value of the treated unit at \(t\) where an intervention took place
- The treatment effect is given by \(\alpha_t =Y_{1t}^I - Y_{1t}^N\)

The issue is, of course, that we don’t observe \(Y_{1t}^N\) for \(t > T_0\)
The synthetic control method aims to best approximate \(Y_{1t}^N\) for \(t > T_0\)

Synthetic control: Face masks

Mitze et al. (2020) apply the synthetic control method to face masks
Used a weighted combination of covariates from 409 regions to build the synthetic control

Synthetic control: Face masks

Mitze et al. (2020) apply the synthetic control method to face masks
Used a weighted combination of covariates from 409 regions to build the synthetic control

Synthetic control: Tesla

Synthetic control: CausalImpact

Synthetic control: Tesla

Use the S&P500 as a synthetic control

Interrupted Time-series Analysis (ITS)

Similar to synthetic control, except that we do not use a control unit (Bernal et al., 2017, 2019)
Leefting & Hinne (2020) apply it to Kundalini Yoga

Happiness / Sadness across the world

Recap

Causal discovery is very ambitious
PC algorithm learns Markov equivalent DAGs from observational data
- Assumes faithfulness, no hidden and no selection variables
- Faithfulness is a strong assumption, conditional independence testing is hard

ICP uses data from different environments to learn causal parents
- Assumes invariant mechanisms (standard in causal inference)
- Does not require faithfulness, can be extended to nonlinear case, hidden variables, time-series

Restricted SCMs learn the causal direction (\(X \rightarrow Y\) or \(Y \rightarrow X\))
- LiNGAMs assume linear function with non-Gaussian error
- Nonlinear Gaussian additive noise models assume non-linear function with Gaussian error

Synthetic control methods create a control to learn causal effect (usually after some intervention)
Interrupted time-series similar in spirit, except no synthetic control

Causal inference

Causal inference

Causal inference

Causal inference

Outline

Core concepts

Causality versus Causal inference

Linking association to causation

Three fundamental structures

Three fundamental structures

Three fundamental structures

Large DAGs and d-separation

Linking independence models

Formalizing interventions

Formalizing interventions

Formalizing interventions

Formalizing interventions

Confounding and valid adjustment

Recap

Simpson’s paradox

Simpson’s paradox

Simpson’s paradox

Simpson’s paradox

Simpson’s paradox

Structural Causal Models

Structural Causal Models

Structural Causal Models

Structural Causal Models

Average causal effect

Recap and resources

Exercises I

Causal discovery

Outline

Causal discovery

DAG learning: First problem

DAG learning: Second problem

PC Algorithm

PC Algorithm

PC Algorithm

PC Algorithm

PC Algorithm

PC Algorithm

PC Algorithm: Example 1

PC Algorithm: Example 1

PC Algorithm: Example 2

PC Algorithm: Example 2

PC Algorithm: Issues

Invariant Causal Prediction (ICP)

Invariant Causal Prediction

Invariant Causal Prediction

Invariant Causal Prediction

Invariant Causal Prediction

Outline

Learning Cause-Effect Pairs

Learning Cause-Effect Pairs

The problem

Linear Non-Gaussian Acyclic Models (LiNGAMs)

Nonlinear Gaussian Additive Noise Model

Cause-Effect pairs data base

Cause-Effect pairs: Example I

Cause-Effect pairs: Example II

Cause-Effect pairs: Example II

Synthetic control: Motivation

Synthetic control: Motivation

Synthetic control: Motivation

Synthetic control: Motivation

Synthetic control: Face masks

Synthetic control: Face masks

Synthetic control: Tesla

Synthetic control: CausalImpact

Synthetic control: Tesla

Interrupted Time-series Analysis (ITS)

Happiness / Sadness across the world

Recap

Exercises II