22th February, 2019

Context

  • Most psychological theories are within-subject, but measurements are between-subject (Luce, 1997)
  • There is no simple mapping between the two (Molenaar, 2004)

  • Ubiquity of smartphones enables collecting intensive longitudinal data (Miller, 2012)
    • Understanding humans as dynamical systems
    • Goes in hand in hand with the network paradigm (Borsboom, 2017)

Vector Autoregression

VAR Model

  • We observe some realization of the (discrete-time) \(p\)-dimensional stochastic process

\[ \{Y(t): t \in T\} \]

  • with state space \(Y(t) \in \mathbb{R}^p\)
  • Vector autoregressive (VAR) model assumes

\[ \mathbf{y}_t = \mathbf{\nu} + \mathbf{A}_1 \mathbf{y}_{t-1} + \ldots + \mathbf{A}_l \mathbf{y}_{t-1} + \mathbf{\epsilon}_t \]

  • \(\nu \in \mathbb{R}^p\) is the intercept
  • \(A_l \in \mathbb{R}^{p \times p}\) describes the coefficients at lag \(l\)
  • \(\mathbf{\epsilon}_t\) are the stochastic innovations at time \(t\), for which

\[ \begin{align} \mathbf{\epsilon}_t \sim \mathcal{N}(0, \Sigma) \\[0.5em] \mathbb{E}[\mathbf{\epsilon}_t\mathbf{\epsilon}_{t-1}] = 0 \end{align} \]

  • VAR process is covariance-stationary, i.e. its first and second moment are time-invariant

AR Model

  • Lag \(l = 1\) VAR model requires estimation of \(A^{p \times p}\)
  • This may lead to high variance of the estimates when data is poor
  • Reduce variance by introducing bias: set all off-diagonal elements to zero
  • This saves estimating \(p \times (p - 1)\) parameters

\[ A_{\text{VAR}} = \begin{pmatrix} \alpha_{11} & \alpha_{12} & \alpha_{13} & \alpha_{14} & \alpha_{15} & \alpha_{16} \\ \alpha_{21} & \alpha_{22} & \alpha_{23} & \alpha_{24} & \alpha_{25} & \alpha_{26} \\ \alpha_{31} & \alpha_{32} & \alpha_{33} & \alpha_{34} & \alpha_{35} & \alpha_{36} \\ \alpha_{41} & \alpha_{42} & \alpha_{43} & \alpha_{44} & \alpha_{45} & \alpha_{46} \\ \alpha_{51} & \alpha_{52} & \alpha_{53} & \alpha_{54} & \alpha_{55} & \alpha_{56} \\ \alpha_{61} & \alpha_{62} & \alpha_{63} & \alpha_{64} & \alpha_{65} & \alpha_{66} \end{pmatrix} \]

\[ A_{\text{AR}} = \begin{pmatrix} \alpha_{11} & 0 & 0 & 0 & 0 & 0 \\ 0 & \alpha_{22} & 0 & 0 & 0 & 0 \\ 0 & 0 & \alpha_{33} & 0 & 0 & 0 \\ 0 & 0 & 0 & \alpha_{44} & 0 & 0 \\ 0 & 0 & 0 & 0 & \alpha_{55} & 0 \\ 0 & 0 & 0 & 0 & 0 & \alpha_{66} \\ \end{pmatrix} \]

Bias-Variance Trade-off

  • Bulteel et al. (2018) compared performance of (mixed) AR and (mixed) VAR models
    • Find equal predictive performance on 3 data sets
    • "[…] it is not meaningful to analyze the presented typical applications with a VAR model" (p. 14)
  • We re-evaluate this claim (Dablander, Ryan, & Haslbeck, under review)

Bayesian VAR

Proper model selection

  • Comparing AR and VAR for substantive reasons is of little value
    • \(\Rightarrow\) Better: Test the VAR coefficients individually
    • \(\Rightarrow\) Best: Quantify uncertainty over all \(2^{p^2}\) graph structures


  • However, Bulteel et al. (2018) have a point
    • VAR model is a fairly complex model
    • Amount and quality of data in psychological applications is rather poor
    • This can lead to overconfident conclusions


  • Thus, we need regularization!

Reasonable priors \(\equiv\) Regularization



Regularizing regression

  • LASSO estimate \(\equiv\) posterior mode with Laplace prior (Park & Casella, 2008)
  • Ridge estimate \(\equiv\) posterior mode with Gaussian prior
  • For an overview, see Van Erp, Oberski, & Mulder (2019)


  • VAR model as a system of seemingly unrelated regressions (SUR)

\[ \begin{align} y_{1t} &= \mathbf{\nu}_1 + \mathbf{a}_1^T \cdot \mathbf{y}_{t-1} + \mathbf{\epsilon}_{1t} \\[.5em] y_{2t} &= \mathbf{\nu}_2 + \mathbf{a}_2^T \cdot \mathbf{y}_{t-1} + \mathbf{\epsilon}_{2t} \\[.5em] \vdots &= \hspace{3em}\vdots \\[.5em] y_{pt} &= \mathbf{\nu}_p + \mathbf{a}_p^T \cdot \mathbf{y}_{t-1} + \mathbf{\epsilon}_{pt} \\ \end{align} \]

  • where \(\mathbb{E}[\mathbf{\epsilon}_{it}\mathbf{\epsilon}_{jt} ] \neq 0\)

Continuous vs. Discrete Shrinkage

  • Horseshoe prior


\[ \begin{align*} \alpha_{ij} \mid \lambda_{ij}, \tau &\sim \text{Normal}(0, \tau^2 \lambda_{ij}^2) \\ \lambda_{ij} &\sim \text{Cauchy}^+(0, 1) \end{align*} \]


  • \(\tau\) shrinks globally
  • \(\lambda_{ij}\) shrinks individual coefficients continuously
  • Incredibly fast


  • Carvalho, Polson, & Scott (2009, 2010)
  • Piironen & Vehtari (2017a, 2017b)
  • Spike-and-Slab Prior


\[ \begin{align*} \alpha_{ij} \mid \lambda_{ij}, \tau &\sim \text{Normal}(0, \tau^2 \lambda_{ij}^2) \\ \lambda_{ij} &\sim \text{Bernoulli}(\theta) \end{align*} \]

  • \(\tau\) shrinks globally
  • \(\lambda_{ij}\) shrinks individual coefficients to 0 or not
  • Incredibly slow: learns all \(2^{p^2}\) models


  • McCulloch & George (1993)
  • Koop (2013)

Continuous vs. Discrete Shrinkage

More Continuous Shrinkage

  • Other continuous shrinkage priors
    • Minnesota prior (Litterman, 1986)
    • Normal-Gamma prior (Huber & Feldkircher, 2017)
    • Regularized Horseshoe (Piironen & Vehtari, 2017b)


  • Very fast
  • However, do not engage in structure learning
    • No posterior over all graphs
    • Structure by (arbitrary) thresholding

Full Spike-and-Slab Model

\[ \begin{align*} \alpha_{ij} &\mid \lambda_{ij}, c_{ij} \sim \lambda_{ij} \, \text{Normal}(0, \tau_{ij}^2) + (1 - \lambda_{ij}) \, \delta_0 \\[0.5em] \lambda_{ij} &\sim \text{Bern}(\theta) \\[0.5em] \theta &\sim \text{Beta}(1, 1) \\[0.5em] \tau_{ij}^2 &\sim \text{Inverse-Gamma}(1/2, s/2) \\[0.5em] \Sigma &\sim \text{Inverse-Wishart}(S, \nu) \end{align*} \]

  • where \(s = 1/2\), \(S = I\), and \(\nu = p\)
  • Benefits:
    • Yields posterior over all possible graphs (structure learning)
    • Yields graded evidence in favor or against \(\mathcal{H}_0: \alpha_{ij} = 0\)
    • Model-averages: \(p(\alpha_{ij} \mid y) = \sum_{k=1}^{p^2} p(\alpha_{ij} \mid y, \mathcal{M}_k) \cdot p(\mathcal{M}_k \mid y) \hspace{1em} \text{vs.} \hspace{1em} p(\alpha_{ij} \mid y, \mathcal{M})\)
  • Downsides:
    • Computationally very expensive

Simulation setup I

  • Estimation error:

\[ \text{RMSE}_{\text{est}} := \sqrt{\frac{1}{p^2} \sum_{i=1}^p \sum_{j=1}^p \left(A_{i, j} - \hat{A}_{i, j}\right)^2} \]

  • Prediction error:

\[ \text{RMSE}_{\text{pred}} := \sqrt{\frac{1}{200} \sum_{i=1}^{200} \frac{1}{p} \sum_{j=1}^p \left(Y_{i, j} - \hat{Y}_{i, j}\right)^2} \enspace \]


  • Sensitivity, Specificity

Simulation setup II

  • For each Bayesian VAR model, on \(t = 10\) replications compute the metrics on combinations of
    • number of variables: \(p = \{5, 10, 15\}\)
    • number of observations: \(n = \{50, 100, 200, 400\}\)
    • network density: \(d = \{0, 0.2, 0.4, 0.6, 0.8, 1\}\)
    • size of the average absolute off-diagonal elements: \(s = \{0.1, 0.2, 0.3\}\)

Simulation results I

Simulation results II

Simulation results III

Data example

  • Re-analyze the MindMaastricht data (Geschwind et al., 2011; Bringmann et al., 2013)
    • ESM study of \(N = 129\) people with residual depressive symptoms
    • 10 measurements per day; 6 days baseline and 6 days after treatment
    • Average time between measurements was 90 minutes
    • Focus on 6 items measured on a 7-point Likert scale during baseline (e.g., "I feel relaxed"")


  • Estimate individual network
  • What can a fully probabilistic perspective give us?

Impulse response function

  • How is a variable at time point \(t + s\) influenced by another one at time point \(t\)?

Extensions

  • Many things discussed in the economics literature
    • High-dimensional models
    • Modeling volatility

  • Extensions that are interesting to me
    • Hierarchical setting
    • Modeling non-stationarity
    • Modeling multicollinearity
    • Modeling more than one stable state
    • Modeling non-Gaussian data (e.g., using copulas)
    • Structured spike-and-slab priors to allow for clustering in graphs
    • Study the mathematical properties of the inclusion Bayes factors discussed here

Wrap-up

  • Claims I've made:
    • Comparing AR vs. VAR models is insufficiently granular
    • Probabilistic perspective on VAR modeling provides this granularity
    • In particular, I have argued for discrete over continuous shrinkage
      • Quantifies uncertainty over all \(2^{p^2}\) graph structures
      • Allows stating graded evidence for or against \(\mathcal{H}_0: \alpha_{ij} = 0\)


  • Preliminary results:
    • All Bayesian VAR models perform similarly well w.r.t. estimation / prediction error
    • Spike-and-slab might have a slight edge in structure recoverability
    • Found large participant heterogeneity while re-analyzing data

Thank you!

References

  • Luce, R. D. (1997). Several unresolved conceptual problems of mathematical psychology. Journal of Mathematical Psychology, 41(1), 79-87.
  • Molenaar, P. C. (2004). A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, this time forever. Measurement, 2(4), 201-218.
  • Borsboom, D. (2017). A network theory of mental disorders. World Psychiatry, 16(1), 5-13.
  • Bulteel, K., Mestdagh, M., Tuerlinckx, F., & Ceulemans, E. (2018). VAR (1) based models do not always outpredict AR (1) models in typical psychological applications. Psychological Methods, 23(4), 1–17.
  • Dablander, F., Ryan, O., & Haslbeck, J. (under review). Choosing between AR (1) and VAR (1) Models in Typical Psychological Applications.
  • Dablander, F., & Hinne, M. (in revision). Centrality measures as a proxy for causal influence? A cautionary tale.
  • Park, T., & Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association, 103(482), 681-686.
  • Van Erp, S., Oberski, D. L., & Mulder, J. (2019). Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89, 31-50.

References

  • Carvalho, C. M., Polson, N. G., & Scott, J. G. (2009, April). Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics (pp. 73-80).
  • Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465-480.
  • Piironen, J., & Vehtari, A. (2017a). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. In Artificial Intelligence and Statistics (pp. 905-913).
  • Piironen, J., & Vehtari, A. (2017b). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2), 5018-5051.
  • George, E. I., & McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423), 881-889.
  • Koop, G. M. (2013). Forecasting with medium and large Bayesian VARs. Journal of Applied Econometrics, 28(2), 177-203.
  • Litterman, R. B. (1986). Forecasting with Bayesian vector autoregressions—five years of experience. Journal of Business & Economic Statistics, 4(1), 25-38.
  • Huber, F., & Feldkircher, M. (2017). Adaptive shrinkage in Bayesian vector autoregressive models. Journal of Business & Economic Statistics, 1-13.

References

  • Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 382-401.
  • Geschwind, N., Peeters, F., Drukker, M., van Os, J., & Wichers, M. (2011). Mindfulness training increases momentary positive emotions and reward experience in adults vulnerable to depression: A randomized controlled trial. Journal of Consulting and Clinical Psychology, 79(5), 618.
  • Bringmann, L. F., Vissers, N., Wichers, M., Geschwind, N., Kuppens, P., Peeters, F., … & Tuerlinckx, F. (2013). A network approach to psychopathology: New insights into clinical longitudinal data. PloS ONE, 8(4), e60188.