Bayesian Vector Autoregression

22th February, 2019

Context

Most psychological theories are within-subject, but measurements are between-subject (Luce, 1997)
There is no simple mapping between the two (Molenaar, 2004)
Ubiquity of smartphones enables collecting intensive longitudinal data (Miller, 2012)
- Understanding humans as dynamical systems
- Goes in hand in hand with the network paradigm (Borsboom, 2017)

Vector Autoregression

VAR Model

We observe some realization of the (discrete-time) \(p\)-dimensional stochastic process

\[ \{Y(t): t \in T\} \]

with state space \(Y(t) \in \mathbb{R}^p\)
Vector autoregressive (VAR) model assumes

\[ \mathbf{y}_t = \mathbf{\nu} + \mathbf{A}_1 \mathbf{y}_{t-1} + \ldots + \mathbf{A}_l \mathbf{y}_{t-1} + \mathbf{\epsilon}_t \]

\(\nu \in \mathbb{R}^p\) is the intercept
\(A_l \in \mathbb{R}^{p \times p}\) describes the coefficients at lag \(l\)
\(\mathbf{\epsilon}_t\) are the stochastic innovations at time \(t\), for which

\[ \begin{align} \mathbf{\epsilon}_t \sim \mathcal{N}(0, \Sigma) \\[0.5em] \mathbb{E}[\mathbf{\epsilon}_t\mathbf{\epsilon}_{t-1}] = 0 \end{align} \]

VAR process is covariance-stationary, i.e. its first and second moment are time-invariant

AR Model

Lag \(l = 1\) VAR model requires estimation of \(A^{p \times p}\)
This may lead to high variance of the estimates when data is poor
Reduce variance by introducing bias: set all off-diagonal elements to zero
This saves estimating \(p \times (p - 1)\) parameters

\[ A_{\text{VAR}} = \begin{pmatrix} \alpha_{11} & \alpha_{12} & \alpha_{13} & \alpha_{14} & \alpha_{15} & \alpha_{16} \\ \alpha_{21} & \alpha_{22} & \alpha_{23} & \alpha_{24} & \alpha_{25} & \alpha_{26} \\ \alpha_{31} & \alpha_{32} & \alpha_{33} & \alpha_{34} & \alpha_{35} & \alpha_{36} \\ \alpha_{41} & \alpha_{42} & \alpha_{43} & \alpha_{44} & \alpha_{45} & \alpha_{46} \\ \alpha_{51} & \alpha_{52} & \alpha_{53} & \alpha_{54} & \alpha_{55} & \alpha_{56} \\ \alpha_{61} & \alpha_{62} & \alpha_{63} & \alpha_{64} & \alpha_{65} & \alpha_{66} \end{pmatrix} \]

\[ A_{\text{AR}} = \begin{pmatrix} \alpha_{11} & 0 & 0 & 0 & 0 & 0 \\ 0 & \alpha_{22} & 0 & 0 & 0 & 0 \\ 0 & 0 & \alpha_{33} & 0 & 0 & 0 \\ 0 & 0 & 0 & \alpha_{44} & 0 & 0 \\ 0 & 0 & 0 & 0 & \alpha_{55} & 0 \\ 0 & 0 & 0 & 0 & 0 & \alpha_{66} \\ \end{pmatrix} \]

Bias-Variance Trade-off

Bulteel et al. (2018) compared performance of (mixed) AR and (mixed) VAR models
- Find equal predictive performance on 3 data sets
- "[…] it is not meaningful to analyze the presented typical applications with a VAR model" (p. 14)
We re-evaluate this claim (Dablander^⋆, Ryan^⋆, & Haslbeck^⋆, under review)

Bayesian VAR

Proper model selection

Comparing AR and VAR for substantive reasons is of little value
- \(\Rightarrow\) Better: Test the VAR coefficients individually
- \(\Rightarrow\) Best: Quantify uncertainty over all \(2^{p^2}\) graph structures

However, Bulteel et al. (2018) have a point
- VAR model is a fairly complex model
- Amount and quality of data in psychological applications is rather poor
- This can lead to overconfident conclusions

Thus, we need regularization!

Reasonable priors \(\equiv\) Regularization

Regularizing regression

LASSO estimate \(\equiv\) posterior mode with Laplace prior (Park & Casella, 2008)
Ridge estimate \(\equiv\) posterior mode with Gaussian prior
For an overview, see Van Erp, Oberski, & Mulder (2019)

VAR model as a system of seemingly unrelated regressions (SUR)

\[ \begin{align} y_{1t} &= \mathbf{\nu}_1 + \mathbf{a}_1^T \cdot \mathbf{y}_{t-1} + \mathbf{\epsilon}_{1t} \\[.5em] y_{2t} &= \mathbf{\nu}_2 + \mathbf{a}_2^T \cdot \mathbf{y}_{t-1} + \mathbf{\epsilon}_{2t} \\[.5em] \vdots &= \hspace{3em}\vdots \\[.5em] y_{pt} &= \mathbf{\nu}_p + \mathbf{a}_p^T \cdot \mathbf{y}_{t-1} + \mathbf{\epsilon}_{pt} \\ \end{align} \]

where \(\mathbb{E}[\mathbf{\epsilon}_{it}\mathbf{\epsilon}_{jt} ] \neq 0\)

Continuous vs. Discrete Shrinkage

Horseshoe prior

\[ \begin{align*} \alpha_{ij} \mid \lambda_{ij}, \tau &\sim \text{Normal}(0, \tau^2 \lambda_{ij}^2) \\ \lambda_{ij} &\sim \text{Cauchy}^+(0, 1) \end{align*} \]

\(\tau\) shrinks globally
\(\lambda_{ij}\) shrinks individual coefficients continuously
Incredibly fast

Carvalho, Polson, & Scott (2009, 2010)
Piironen & Vehtari (2017a, 2017b)

Spike-and-Slab Prior

\[ \begin{align*} \alpha_{ij} \mid \lambda_{ij}, \tau &\sim \text{Normal}(0, \tau^2 \lambda_{ij}^2) \\ \lambda_{ij} &\sim \text{Bernoulli}(\theta) \end{align*} \]

\(\tau\) shrinks globally
\(\lambda_{ij}\) shrinks individual coefficients to 0 or not
Incredibly slow: learns all \(2^{p^2}\) models

McCulloch & George (1993)
Koop (2013)

Continuous vs. Discrete Shrinkage

More Continuous Shrinkage

Other continuous shrinkage priors
- Minnesota prior (Litterman, 1986)
- Normal-Gamma prior (Huber & Feldkircher, 2017)
- Regularized Horseshoe (Piironen & Vehtari, 2017b)

Very fast
However, do not engage in structure learning
- No posterior over all graphs
- Structure by (arbitrary) thresholding

Full Spike-and-Slab Model

\[ \begin{align*} \alpha_{ij} &\mid \lambda_{ij}, c_{ij} \sim \lambda_{ij} \, \text{Normal}(0, \tau_{ij}^2) + (1 - \lambda_{ij}) \, \delta_0 \\[0.5em] \lambda_{ij} &\sim \text{Bern}(\theta) \\[0.5em] \theta &\sim \text{Beta}(1, 1) \\[0.5em] \tau_{ij}^2 &\sim \text{Inverse-Gamma}(1/2, s/2) \\[0.5em] \Sigma &\sim \text{Inverse-Wishart}(S, \nu) \end{align*} \]

where \(s = 1/2\), \(S = I\), and \(\nu = p\)
Benefits:
- Yields posterior over all possible graphs (structure learning)
- Yields graded evidence in favor or against \(\mathcal{H}_0: \alpha_{ij} = 0\)
- Model-averages: \(p(\alpha_{ij} \mid y) = \sum_{k=1}^{p^2} p(\alpha_{ij} \mid y, \mathcal{M}_k) \cdot p(\mathcal{M}_k \mid y) \hspace{1em} \text{vs.} \hspace{1em} p(\alpha_{ij} \mid y, \mathcal{M})\)
Downsides:
- Computationally very expensive

Simulation setup I

Estimation error:

\[ \text{RMSE}_{\text{est}} := \sqrt{\frac{1}{p^2} \sum_{i=1}^p \sum_{j=1}^p \left(A_{i, j} - \hat{A}_{i, j}\right)^2} \]

Prediction error:

\[ \text{RMSE}_{\text{pred}} := \sqrt{\frac{1}{200} \sum_{i=1}^{200} \frac{1}{p} \sum_{j=1}^p \left(Y_{i, j} - \hat{Y}_{i, j}\right)^2} \enspace \]

Sensitivity, Specificity

Simulation setup II

For each Bayesian VAR model, on \(t = 10\) replications compute the metrics on combinations of
- number of variables: \(p = \{5, 10, 15\}\)
- number of observations: \(n = \{50, 100, 200, 400\}\)
- network density: \(d = \{0, 0.2, 0.4, 0.6, 0.8, 1\}\)
- size of the average absolute off-diagonal elements: \(s = \{0.1, 0.2, 0.3\}\)

Simulation results I

Simulation results II

Simulation results III

Data example

Re-analyze the MindMaastricht data (Geschwind et al., 2011; Bringmann et al., 2013)
- ESM study of \(N = 129\) people with residual depressive symptoms
- 10 measurements per day; 6 days baseline and 6 days after treatment
- Average time between measurements was 90 minutes
- Focus on 6 items measured on a 7-point Likert scale during baseline (e.g., "I feel relaxed"")

Estimate individual network
What can a fully probabilistic perspective give us?

Impulse response function

How is a variable at time point \(t + s\) influenced by another one at time point \(t\)?

Extensions

Many things discussed in the economics literature
- High-dimensional models
- Modeling volatility

Extensions that are interesting to me
- Hierarchical setting
- Modeling non-stationarity
- Modeling multicollinearity
- Modeling more than one stable state
- Modeling non-Gaussian data (e.g., using copulas)
- Structured spike-and-slab priors to allow for clustering in graphs
- Study the mathematical properties of the inclusion Bayes factors discussed here

Wrap-up

Claims I've made:
- Comparing AR vs. VAR models is insufficiently granular
- Probabilistic perspective on VAR modeling provides this granularity
- In particular, I have argued for discrete over continuous shrinkage
  - Quantifies uncertainty over all \(2^{p^2}\) graph structures
  - Allows stating graded evidence for or against \(\mathcal{H}_0: \alpha_{ij} = 0\)

Preliminary results:
- All Bayesian VAR models perform similarly well w.r.t. estimation / prediction error
- Spike-and-slab might have a slight edge in structure recoverability
- Found large participant heterogeneity while re-analyzing data

Thank you!

Joint work with the awesome Max Hinne & EJ Wagenmakers
You can find me on Twitter @fdabl and https://fdabl.github.io/

References

Luce, R. D. (1997). Several unresolved conceptual problems of mathematical psychology. Journal of Mathematical Psychology, 41(1), 79-87.
Molenaar, P. C. (2004). A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, this time forever. Measurement, 2(4), 201-218.
Borsboom, D. (2017). A network theory of mental disorders. World Psychiatry, 16(1), 5-13.
Bulteel, K., Mestdagh, M., Tuerlinckx, F., & Ceulemans, E. (2018). VAR (1) based models do not always outpredict AR (1) models in typical psychological applications. Psychological Methods, 23(4), 1–17.
Dablander^⋆, F., Ryan^⋆, O., & Haslbeck^⋆, J. (under review). Choosing between AR (1) and VAR (1) Models in Typical Psychological Applications.
Dablander, F., & Hinne, M. (in revision). Centrality measures as a proxy for causal influence? A cautionary tale.
Park, T., & Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association, 103(482), 681-686.
Van Erp, S., Oberski, D. L., & Mulder, J. (2019). Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89, 31-50.

References

Carvalho, C. M., Polson, N. G., & Scott, J. G. (2009, April). Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics (pp. 73-80).
Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465-480.
Piironen, J., & Vehtari, A. (2017a). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. In Artificial Intelligence and Statistics (pp. 905-913).
Piironen, J., & Vehtari, A. (2017b). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2), 5018-5051.
George, E. I., & McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423), 881-889.
Koop, G. M. (2013). Forecasting with medium and large Bayesian VARs. Journal of Applied Econometrics, 28(2), 177-203.
Litterman, R. B. (1986). Forecasting with Bayesian vector autoregressions—five years of experience. Journal of Business & Economic Statistics, 4(1), 25-38.
Huber, F., & Feldkircher, M. (2017). Adaptive shrinkage in Bayesian vector autoregressive models. Journal of Business & Economic Statistics, 1-13.

References

Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 382-401.
Geschwind, N., Peeters, F., Drukker, M., van Os, J., & Wichers, M. (2011). Mindfulness training increases momentary positive emotions and reward experience in adults vulnerable to depression: A randomized controlled trial. Journal of Consulting and Clinical Psychology, 79(5), 618.
Bringmann, L. F., Vissers, N., Wichers, M., Geschwind, N., Kuppens, P., Peeters, F., … & Tuerlinckx, F. (2013). A network approach to psychopathology: New insights into clinical longitudinal data. PloS ONE, 8(4), e60188.