Processing math: 100%

Bayesian Vector Autoregression

For Structure and Parameter Learning

Fabian Dablander, University of Amsterdam

22th February, 2019

Context

  • Most psychological theories are within-subject, but measurements are between-subject (Luce, 1997)
  • There is no simple mapping between the two (Molenaar, 2004)

  • Ubiquity of smartphones enables collecting intensive longitudinal data (Miller, 2012)
    • Understanding humans as dynamical systems
    • Goes in hand in hand with the network paradigm (Borsboom, 2017)

Vector Autoregression

VAR Model

  • We observe some realization of the (discrete-time) p-dimensional stochastic process

{Y(t):tT}

  • with state space Y(t)Rp
  • Vector autoregressive (VAR) model assumes

yt=ν+A1yt1++Alyt1+ϵt

  • νRp is the intercept
  • AlRp×p describes the coefficients at lag l
  • ϵt are the stochastic innovations at time t, for which

ϵtN(0,Σ)E[ϵtϵt1]=0

  • VAR process is covariance-stationary, i.e. its first and second moment are time-invariant

AR Model

  • Lag l=1 VAR model requires estimation of Ap×p
  • This may lead to high variance of the estimates when data is poor
  • Reduce variance by introducing bias: set all off-diagonal elements to zero
  • This saves estimating p×(p1) parameters

AVAR=(α11α12α13α14α15α16α21α22α23α24α25α26α31α32α33α34α35α36α41α42α43α44α45α46α51α52α53α54α55α56α61α62α63α64α65α66)

AAR=(α11000000α22000000α33000000α44000000α55000000α66)

Bias-Variance Trade-off

  • Bulteel et al. (2018) compared performance of (mixed) AR and (mixed) VAR models
    • Find equal predictive performance on 3 data sets
    • "[…] it is not meaningful to analyze the presented typical applications with a VAR model" (p. 14)
  • We re-evaluate this claim (Dablander, Ryan, & Haslbeck, under review)

Bayesian VAR

Proper model selection

  • Comparing AR and VAR for substantive reasons is of little value
    • Better: Test the VAR coefficients individually
    • Best: Quantify uncertainty over all 2p2 graph structures


  • However, Bulteel et al. (2018) have a point
    • VAR model is a fairly complex model
    • Amount and quality of data in psychological applications is rather poor
    • This can lead to overconfident conclusions


  • Thus, we need regularization!

Reasonable priors Regularization



Regularizing regression

  • LASSO estimate posterior mode with Laplace prior (Park & Casella, 2008)
  • Ridge estimate posterior mode with Gaussian prior
  • For an overview, see Van Erp, Oberski, & Mulder (2019)


  • VAR model as a system of seemingly unrelated regressions (SUR)

y1t=ν1+aT1yt1+ϵ1ty2t=ν2+aT2yt1+ϵ2t=ypt=νp+aTpyt1+ϵpt

  • where E[ϵitϵjt]0

Continuous vs. Discrete Shrinkage

  • Horseshoe prior


αijλij,τNormal(0,τ2λ2ij)λijCauchy+(0,1)


  • τ shrinks globally
  • λij shrinks individual coefficients continuously
  • Incredibly fast


  • Carvalho, Polson, & Scott (2009, 2010)
  • Piironen & Vehtari (2017a, 2017b)
  • Spike-and-Slab Prior


αijλij,τNormal(0,τ2λ2ij)λijBernoulli(θ)

  • τ shrinks globally
  • λij shrinks individual coefficients to 0 or not
  • Incredibly slow: learns all 2p2 models


  • McCulloch & George (1993)
  • Koop (2013)

Continuous vs. Discrete Shrinkage

More Continuous Shrinkage

  • Other continuous shrinkage priors
    • Minnesota prior (Litterman, 1986)
    • Normal-Gamma prior (Huber & Feldkircher, 2017)
    • Regularized Horseshoe (Piironen & Vehtari, 2017b)


  • Very fast
  • However, do not engage in structure learning
    • No posterior over all graphs
    • Structure by (arbitrary) thresholding

Full Spike-and-Slab Model

αijλij,cijλijNormal(0,τ2ij)+(1λij)δ0λijBern(θ)θBeta(1,1)τ2ijInverse-Gamma(1/2,s/2)ΣInverse-Wishart(S,ν)

  • where s=1/2, S=I, and ν=p
  • Benefits:
    • Yields posterior over all possible graphs (structure learning)
    • Yields graded evidence in favor or against H0:αij=0
    • Model-averages: p(αijy)=p2k=1p(αijy,Mk)p(Mky)vs.p(αijy,M)
  • Downsides:
    • Computationally very expensive

Simulation setup I

  • Estimation error:

RMSEest:=1p2pi=1pj=1(Ai,jˆAi,j)2

  • Prediction error:

RMSEpred:=1200200i=11ppj=1(Yi,jˆYi,j)2


  • Sensitivity, Specificity

Simulation setup II

  • For each Bayesian VAR model, on t=10 replications compute the metrics on combinations of
    • number of variables: p={5,10,15}
    • number of observations: n={50,100,200,400}
    • network density: d={0,0.2,0.4,0.6,0.8,1}
    • size of the average absolute off-diagonal elements: s={0.1,0.2,0.3}

Simulation results I

Simulation results II

Simulation results III

Data example

  • Re-analyze the MindMaastricht data (Geschwind et al., 2011; Bringmann et al., 2013)
    • ESM study of N=129 people with residual depressive symptoms
    • 10 measurements per day; 6 days baseline and 6 days after treatment
    • Average time between measurements was 90 minutes
    • Focus on 6 items measured on a 7-point Likert scale during baseline (e.g., "I feel relaxed"")


  • Estimate individual network
  • What can a fully probabilistic perspective give us?

Impulse response function

  • How is a variable at time point t+s influenced by another one at time point t?

Extensions

  • Many things discussed in the economics literature
    • High-dimensional models
    • Modeling volatility

  • Extensions that are interesting to me
    • Hierarchical setting
    • Modeling non-stationarity
    • Modeling multicollinearity
    • Modeling more than one stable state
    • Modeling non-Gaussian data (e.g., using copulas)
    • Structured spike-and-slab priors to allow for clustering in graphs
    • Study the mathematical properties of the inclusion Bayes factors discussed here

Wrap-up

  • Claims I've made:
    • Comparing AR vs. VAR models is insufficiently granular
    • Probabilistic perspective on VAR modeling provides this granularity
    • In particular, I have argued for discrete over continuous shrinkage
      • Quantifies uncertainty over all 2p2 graph structures
      • Allows stating graded evidence for or against H0:αij=0


  • Preliminary results:
    • All Bayesian VAR models perform similarly well w.r.t. estimation / prediction error
    • Spike-and-slab might have a slight edge in structure recoverability
    • Found large participant heterogeneity while re-analyzing data

Thank you!

References

  • Luce, R. D. (1997). Several unresolved conceptual problems of mathematical psychology. Journal of Mathematical Psychology, 41(1), 79-87.
  • Molenaar, P. C. (2004). A manifesto on psychology as idiographic science: Bringing the person back into scientific psychology, this time forever. Measurement, 2(4), 201-218.
  • Borsboom, D. (2017). A network theory of mental disorders. World Psychiatry, 16(1), 5-13.
  • Bulteel, K., Mestdagh, M., Tuerlinckx, F., & Ceulemans, E. (2018). VAR (1) based models do not always outpredict AR (1) models in typical psychological applications. Psychological Methods, 23(4), 1–17.
  • Dablander, F., Ryan, O., & Haslbeck, J. (under review). Choosing between AR (1) and VAR (1) Models in Typical Psychological Applications.
  • Dablander, F., & Hinne, M. (in revision). Centrality measures as a proxy for causal influence? A cautionary tale.
  • Park, T., & Casella, G. (2008). The Bayesian lasso. Journal of the American Statistical Association, 103(482), 681-686.
  • Van Erp, S., Oberski, D. L., & Mulder, J. (2019). Shrinkage priors for Bayesian penalized regression. Journal of Mathematical Psychology, 89, 31-50.

References

  • Carvalho, C. M., Polson, N. G., & Scott, J. G. (2009, April). Handling sparsity via the horseshoe. In Artificial Intelligence and Statistics (pp. 73-80).
  • Carvalho, C. M., Polson, N. G., & Scott, J. G. (2010). The horseshoe estimator for sparse signals. Biometrika, 97(2), 465-480.
  • Piironen, J., & Vehtari, A. (2017a). On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior. In Artificial Intelligence and Statistics (pp. 905-913).
  • Piironen, J., & Vehtari, A. (2017b). Sparsity information and regularization in the horseshoe and other shrinkage priors. Electronic Journal of Statistics, 11(2), 5018-5051.
  • George, E. I., & McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association, 88(423), 881-889.
  • Koop, G. M. (2013). Forecasting with medium and large Bayesian VARs. Journal of Applied Econometrics, 28(2), 177-203.
  • Litterman, R. B. (1986). Forecasting with Bayesian vector autoregressions—five years of experience. Journal of Business & Economic Statistics, 4(1), 25-38.
  • Huber, F., & Feldkircher, M. (2017). Adaptive shrinkage in Bayesian vector autoregressive models. Journal of Business & Economic Statistics, 1-13.

References

  • Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 382-401.
  • Geschwind, N., Peeters, F., Drukker, M., van Os, J., & Wichers, M. (2011). Mindfulness training increases momentary positive emotions and reward experience in adults vulnerable to depression: A randomized controlled trial. Journal of Consulting and Clinical Psychology, 79(5), 618.
  • Bringmann, L. F., Vissers, N., Wichers, M., Geschwind, N., Kuppens, P., Peeters, F., … & Tuerlinckx, F. (2013). A network approach to psychopathology: New insights into clinical longitudinal data. PloS ONE, 8(4), e60188.