In a previous blog post, we looked at the history of least squares, how Gauss justified it using the Gaussian distribution, and how Laplace justified the Gaussian distribution using the central limit theorem. The Gaussian distribution has a number of special properties which distinguish it from other distributions and which make it easy to work with mathematically. In this blog post, I will focus on two of these properties: being closed under (a) marginalization and (b) conditioning. This means that, if one starts with a p-dimensional Gaussian distribution and marginalizes out or conditions on one or more of its components, the resulting distribution will still be Gaussian.
This blog post has two parts. First, I will introduce the joint, marginal, and conditional Gaussian distributions for the case of two random variables; an interactive Shiny app illustrates the differences between them. Second, I will show mathematically that the marginal and conditional distribution do indeed have the form I presented in the first part. I will extend this to the p-dimensional case, demonstrating that the Gaussian distribution is closed under marginalization and conditioning. This second part is a little heavier on the mathematics, so if you just want to get an intuition you may focus on the first part and simply skip the second part. Let’s get started!
In the linear regression case discussed previously, we have modeled each individual data point yi as coming from a univariate conditional Gaussian distribution with mean μ=xTib and variance σ2. In this blog post, we introduce the random variables X1 and X2 and assume that both are jointly normally distributed; we are going from p=1 to p=2 dimensions. The probability density function changes accordingly — it becomes a function mapping from two to one dimension, i.e., f:R2→R+.
To simplify notation, let x=(x1,x2)T and μ=(μ1,μ2)T be two 2-dimensional vectors denoting one observation and the population means, respectively. For simplicity, we set the population means to zero, i.e. μ=(0,0). In one dimension, we had just one parameter for the variance σ2; in two dimensions, this becomes a symmetric 2×2 covariance matrix
Σ=(σ21ρσ1σ2ρσ1σ2σ22),where σ21 and σ22 are the population variances of the random variables X1 and X2, respectively, and ρ is the population correlation between the two. The general form of the density function of a p-dimensional Gaussian distribution is
f(x∣μ,Σ)=(2π)−p/2|Σ|−1/2exp(−12(x−μ)TΣ−1(x−μ)),where x and μ are a p-dimensional vectors, Σ−1 is the (p×p)-dimensional inverse covariance matrix and |Σ| is its determinant.1 We focus on the simpler 2-dimensional, zero-mean case. Observe that
Σ−1=1|Σ|(σ22−ρσ1σ2−ρσ1σ2σ21)=1σ21σ22(1−ρ2)(σ22−ρσ1σ2−ρσ1σ2σ21),which we use to expand the bivariate Gaussian density function:
f(x,y∣σ21,σ22,ρ)=12π√σ21σ22(1−ρ2)exp(−12σ21σ22(1−ρ2)(x1x2)T(σ22−ρσ1σ2−ρσ1σ2σ21)(x1x2))=12π√σ21σ22(1−ρ2)exp(−12σ21σ22(1−ρ2)(x1σ22−x2ρσ1σ2x2σ21−x1ρσ1σ2)T(x1x2))=12π√σ21σ22(1−ρ2)exp(−12σ21σ22(1−ρ2)[σ22x21−2ρσ1σ2x1x2+σ21x22]).The figure below plots the contour lines of nine different bivariate normal distributions with mean zero, correlations ρ∈[0,−0.3,0.7], and standard deviations σ1,σ2∈[1,2].
In the top row, all bivariate Gaussian distributions have ρ=0 and look like a circle for standard deviations of equal size. The top middle plot is stretched along X2, giving it an elliptical shape. The middle and last row show how the distribution changes for negative (ρ=−0.3) and positive (ρ=0.7) correlations.2
In the remainder of this blog post, we will take a closer look at two operations: marginalization and conditioning. Marginalizing means ignoring, and conditioning means incorporating information. In the zero-mean bivariate case, marginalizing out X2 results in
f(x1)=1√2πσ21exp(−12σ21x21),which is a simple univariate Gaussian distribution with mean 0 and variance σ21. On the other hand, incorporating the information that X2=x2 results in
f(x1∣x2)=1√2πσ21(1−ρ2)exp(−12σ21(1−ρ2)(x1−ρσ1σ2x2)2),which has mean ρσ1σ2x2 and variance σ21(1−ρ2). The next section provides two simple examples illustrating the difference between these two types of distributions, as well as a simple Shiny app that allows you to build an intuition for conditioning in the bivariate case.
Let’s illustrate the difference between marginalization and conditioning on two simple examples. First, assume that the correlation is very high with ρ=0.8, and that σ21=σ22=1. Then, observing for example X2=2, our belief about X1 changes such that its mean gets shifted to the observed x2 value, i.e. μ1=0.8⋅2=1.6 (indicated by the dotted line in the Figure below). The variance of x1 gets substantially reduced, from 1 to (1−0.82)=0.36. This is what the left part in the Figure below illustrates. If, on the other hand, ρ=0 such that X1 and X2 are not related, then observing X2=2 changes neither the mean of X1 (it stays at zero), nor its variance (it stays at 1); see the right part of the Figure below. Note that the marginal and conditional densities are multiplied with a constant to make them better visible.
To explore the relation between joint, marginal, and conditional Gaussian distributions, you can play around with a Shiny app following this link. In the remainder of the blog post, we will prove that the two distributions given above are in fact the marginal and conditional distributions in the two-dimensional case. We will also generalize these results to p-dimensional Gaussian distributions.
In the second part of this blog post, we need the two fundamental ‘rules’ of probability: the sum and the product rule.3 The sum rule states that
p(x)=∫p(x,y)dy,and the product rule states that
p(x,y)=p(x∣y)p(y)=p(y∣x)p(x).In the remainder, we will see that a joint Gaussian distribution can be factorized into a conditional Gaussian and a marginal Gaussian distribution.
The first property states that if we marginalize out variables in a multivariate Gaussian distribution, the result is still a Gaussian distribution. The Gaussian distribution is thus closed under marginalization. Below, I will show this for a bivariate Gaussian distribution directly, and for an arbitrary dimensional Gaussian distributions by thinking rather than computing. This illustrates that knowing your definitions can help avoid tedious calculations.
To show that the marginalisation property holds for the bivariate Gaussian distribution, we need to solve the following integration problem
∫X2f(x1,x2∣σ21,σ22,ρ)dx2,and check whether the result is a univariate Gaussian distribution. We tackle the problem head on and expand
∫X212π√σ21σ22(1−ρ2)exp(−12σ21σ22(1−ρ2)[σ22x21−2ρσ1σ2x1x2+σ21x22])dx2=12π√σ21σ22(1−ρ2)exp(−12σ21σ22(1−ρ2)σ22x21)∫X2exp(−12σ21σ22(1−ρ2)[σ21x22−2ρσ1σ2x1x2])dx2=12π√σ21σ22(1−ρ2)exp(−12σ21σ22(1−ρ2)σ22x21)∫X2exp(−12σ22(1−ρ2)[x22−2ρσ2σ1x1x2])dx2.Putting everything that does not involve x2 outside the integral, we’ve come quite far! Note that we can “complete the square”, that is, write
x22−2ρσ2σ1x1x2=(x2−ρσ2σ1x1)2−ρ2σ22σ21x21.This leads to
12π√σ21σ22(1−ρ2)exp(−12σ21σ22(1−ρ2)σ22x21)∫X2exp(−12σ22(1−ρ2)[(x2−ρσ2σ1x1)2−ρ2σ22σ21x21])dx2=12π√σ21σ22(1−ρ2)exp(−12σ21σ22(1−ρ2)σ22x21+12σ22(1−ρ2)ρ2σ22σ21x21)∫X2exp(−12σ22(1−ρ2)(x2−ρσ2σ1x1)2)dx2=12π√σ21σ22(1−ρ2)exp(−12σ21(1−ρ2)x21+12σ21(1−ρ2)ρ2x21)∫X2exp(−12σ22(1−ρ2)(x2−ρσ2σ1x1)2)dx2=12π√σ21σ22(1−ρ2)exp(−x21−ρ2x212σ21(1−ρ2))∫X2exp(−12σ22(1−ρ2)(x2−ρσ2σ1x1)2)dx2=12π√σ21σ22(1−ρ2)exp(−x21(1−ρ2)2σ21(1−ρ2))∫X2exp(−12σ22(1−ρ2)(x2−ρσ2σ1x1)2)dx2=12π√σ21σ22(1−ρ2)exp(−12σ21x21)∫X2exp(−12σ22(1−ρ2)(x2−ρσ2σ1x1)2)dx2.We are nearly done! What’s left is to realize that the integrand is the kernel of a univariate Gaussian distribution with mean ρσ2σ1x1 and variance σ22(1−ρ2) — it’s an unnormalized conditional Gaussian distribution! The thing that makes a Gaussian distribution integrate to 1, as all distributions must, is the normalizing constant in front, the strange term involving π. For this particular distribution, the normalizing constant is √2πσ22(1−ρ2).4
Continuing, we arrive at the solution
12π√σ21σ22(1−ρ2)exp(−12σ21x21)√2πσ22(1−ρ2)=1√2πσ21exp(−12σ21x21),which is the density function of the univariate Gaussian distribution (with mean zero). With some work, we have shown that marginalizing out a variable in a bivariate Gaussian distribution leads to a univariate Gaussian distribution. This process ‘removes’ any occurances of the correlation ρ and the other variable x2.
Granted, this process was rather tedious and not at all general (but good practice!) — does it also work when going from 3 to 2 dimensions? Will the remaining bivariate distribution be Gaussian? What if we go from 200 dimensions to 97 dimensions?
A more elegant way to see that a p-dimensional Gaussian distribution is closed under marginalization is the following. First, we note the requirement that a random variable needs to fulfill in order to have a (multivariate) Gaussian distribution.
Definition. X=(X1,…,Xp)T has a multivariate Gaussian distribution if every linear combination of its components has a (multivariate) Gaussian distribution. Formally,
X∼N(μ,Σ)if and only ifAX∼N(Aμ,AΣAT),see for example Blitzstein & Hwang (2014, pp. 309-310).
Second, from this it immediately follows that any subset of random variables H⊂X are themselves normally distributed, and the mean and covariance is given by simply ignoring all elements that are not in H; this is called the marginalisation property. In particular, we choose a linear transformation that simply ignores the components we want to marginalize out. As an example, let’s take the trivariate Gaussian distribution
(X1X2X3)∼N((μ1μ2μ3),(σ21ρ12σ1σ2σ22ρ13σ1σ3ρ23σ2σ3σ23)),which has a three-dimensional mean vector and adds a variance and two correlations to the (symmetric) covariance matrix. Define
A=(100010000),which picks out the components X1 and X2 and ignores X3. Putting this into the equality from the definition, we arrive at
(100010000)(X1X2X3)∼N((100010000)(μ1μ2μ3),(100010000)(σ21ρ12σ1σ2σ22ρ13σ1σ3ρ23σ2σ3σ23)(100010000))(X1X2)∼N((μ1μ2),(σ21ρ12σ1σ2σ22)).Two points to wrap up. First, it helps to know your definitions. Second, in the Gaussian case, computing marginal distributions is trivial. Conditional distributions are a bit harder, unfortunately. But not by much.
Conditioning means incorporating information. The fact that Gaussian distributions are closed under conditioning means that, if we start with a Gaussian distribution and update our knowledge given the observed value of one of its components, then the resulting distribution is still Gaussian — we never have to leave the wonderful land of the Gaussians! In the following, we prove this first for the simple bivariate case, which should also give some intuition as to how conditioning differs from marginalizing, and then provide the more general expression for p dimensions.
Instead of ignoring information, as we did when computing marginal distributions above, we now want to incorporate information we have about the other random variable X2. Conditioning implies learning: how does our knowledge that X2=x2 change our knowledge about X1?
Let’s say we observe X2=x2. How does that change our beliefs about X1? The product rule above leads to Bayes’ rule (via simple division), which is exactly what we need:
f(x1∣x2)=f(x1,x2)f(x2),where we have suppressed conditioning on the parameters ρ,σ21,σ22 to avoid cluttered notation. Let’s do some algebra! We write
f(x1∣x2)=12π√σ21σ22(1−ρ2)exp(−12σ21σ22(1−ρ2)[σ22x21−2ρσ1σ2x1x2+σ21x22])1√2πσ22exp(−12σ22x22)=1√2πσ21(1−ρ2)exp(−12σ21σ22(1−ρ2)[σ22x21−2ρσ1σ2x1x2+σ21x22]+12σ22x22),which already looks promising. Putting the x22 term into the angular brackets, we should see a nice quadratic formula:
1√2πσ21(1−ρ2)exp(−12σ21σ22(1−ρ2)[σ22x21−2ρσ1σ2x1x2+σ21x22−2σ21σ22(1−ρ2)2σ22x22])=1√2πσ21(1−ρ2)exp(−12σ21σ22(1−ρ2)[σ22x21−2ρσ1σ2x1x2+σ21x22−σ21(1−ρ2)x22])=1√2πσ21(1−ρ2)exp(−12σ21σ22(1−ρ2)[σ22x21−2ρσ1σ2x1x2+σ21x22−σ21x22+σ21ρ2x22])=1√2πσ21(1−ρ2)exp(−12σ21σ22(1−ρ2)[σ22x21−2ρσ1σ2x1x2+σ21ρ2x22])=1√2πσ21(1−ρ2)exp(−12σ21(1−ρ2)[x21−2ρσ1σ2x1x2+σ21σ22ρ2x22])=1√2πσ21(1−ρ2)exp(−12σ21(1−ρ2)(x1−ρσ1σ2x2)2).Done! The conditional distribution has a mean of ρσ1σ2x2 and a variance of σ21(1−ρ2). How does this look like in p dimensions?
We need a little bit more notation for the crazy ride we’re about to embark on. Let x=(x1,…,xn)T be an n-dimensional vector and y=(y1,…,ym)T an m-dimensional vector which both are jointly Gaussian distributed with covariance matrix Σ∈R(n+m)×(n+m). Note that we can write Σ as a block matrix, i.e.,
Σ=(ΣxxΣxyΣyxΣyy),where Σxx and Σyy are the covariance matrices of x and y, respectively, and Σxy=(Σyx)T gives the covariance between x and y. We remember the density function of a multivariate Gaussian distribution from above, and take a first stab:
f(x∣y)=f(x,y)f(y)=(2π)−(n+m)/2|Σ|−1/2exp(−12[(xy)TΣ−1(xy)])(2π)−m/2|Σyy|−1/2exp(−12[yTΣ−1yyy])=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12[(xy)T(ΣxxΣxyΣyxΣyy)−1(xy)]+yTΣ−1yyy)=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12[(xy)T(ΣxxΣxyΣyxΣyy)−1(xy)−yTΣ−1yyy]).There’s only a slight problem. The inverse of the block matrix is pretty ugly:
Σ−1=((Σxx−ΣxyΣ−1yyΣyx)−1−(Σxx−ΣxyΣ−1yyΣxy)−1ΣxyΣ−1yy−Σ−1yyΣyx(Σxx−ΣxyΣ−1yyΣxy)−1Σ−1yy+Σ−1yyΣyx(Σxx−ΣxyΣ−1yyΣxy)−1ΣxyΣ−1yy),where Σxx−ΣxyΣ−1yyΣxy is the Schur complement of Σxx in the block matrix above. Let’s be lazy and delay computation by simply renaming the relevant parts, i.e.,
Ω=(ΩxxΩxyΩyxΩyy)=Σ−1.Proceeding bravely, we write
f(x∣y)=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12[(xy)T(ΩxxΩxyΩyxΩyy)(xy)−yTΣ−1yyy])=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12[(xTΩxx+yTΩyxxTΩxy+yTΩyy)T(xy)−yTΣ−1yyy])=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12[xTΩxxx+yTΩyxx+xTΩxyy+yTΩyyy−yTΣ−1yyy])=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12[xTΩxxx+2xTΩxyy+yT(Ωyy−Σ−1yy)y]),where we get the last line by noting that yTΩyxx=(xTΩxyy)T, i.e. they give the same scalar. It is also important to keep in mind that Ωyy≠Σ−1yy.
There is hope: we are in an analogous situation as in the two-dimensional case described above. Somehow we must be able to ‘‘complete the square’’ in the more general p-dimensional case, too.
Scribbling on paper for a bit, we dare to conjecture that the conditional distribution is
f(x∣y)=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12(x+Ω−1xxΩxyy)TΩxx(x+Ω−1xxΩxyy)),which expands into
=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12[xTΩxxx+xTΩxxΩ−1xxΩxyy+yTΩTxyΩ−TxxΩxxx+yTΩTxyΩ−TxxΩxxΩ−1xxΩxyy])=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12[xTΩxxx+2xTΩxyy+yTΩyxΩ−1xxΩxyy]).For our conjecture to be true, it must hold that
Ωyy−Σ−1yy=ΩyxΩ−1xxΩxy.Indeed, remember that
Σ−1=(ΩxxΩxyΩyxΩyy)=((Σxx−ΣxyΣ−1yyΣyx)−1−(Σxx−ΣxyΣ−1yyΣxy)−1ΣxyΣ−1yy−Σ−1yyΣyx(Σxx−ΣxyΣ−1yyΣxy)−1Σ−1yy+Σ−1yyΣyx(Σxx−ΣxyΣ−1yyΣxy)−1ΣxyΣ−1yy),and therefore
ΩyxΩ−1xxΩxy=−Σ−1yyΣyxI⏞(Σxx−ΣxyΣ−1yyΣxy)−1(Σxx−ΣxyΣ−1yyΣyx)(−(Σxx−ΣxyΣ−1yyΣxy)−1ΣxyΣ−1yy)=−Σ−1yyΣyx(−(Σxx−ΣxyΣ−1yyΣxy)−1ΣxyΣ−1yy)=Σ−1yyΣyx(Σxx−ΣxyΣ−1yyΣxy)−1ΣxyΣ−1yy=Σ−1yy+Σ−1yyΣyx(Σxx−ΣxyΣ−1yyΣxy)−1ΣxyΣ−1yy⏟Ωyy−Σ−1yy.This means that we have correctly completed the square! To clean up the business of the determinants, note that the determinant of a block matrix factors such that
|Σ|=|Σxx−ΣxyΣ−1yyΣyx|×|Σyy|.Substituting this into our equation for the conditional density, as well as substituting all the Ω’s with Σ’s, results in
f(x∣y)=(2π)−n/2|Σ|−1/2|Σyy|1/2exp(−12(x+Ω−1xxΩxyy)TΩxx(x+Ω−1xxΩxyy))=(2π)−n/2|Σxx−ΣxyΣ−1yyΣyx|−1/2exp(−12(x−ΣxyΣ−1yyy)T(Σxx−ΣxyΣ−1yyΣyx)−1(x−ΣxyΣ−1yyy)).Thus, if (x,y) are jointly normally distributed, then incorporating the information that Y=y leads to a conditional distribution f(x∣y) that is Gaussian with conditional mean ΣxyΣ−1yyy and conditional covariance Σxx−ΣxyΣ−1yyΣyx.
In this blog post, we have seen that the Gaussian distribution has two important properties: it is closed under (a) marginalization and (b) conditioning. For the bivariate case, an accompanying Shiny app hopefully helped to build some intuition about the difference between these two operations.
For the general p-dimensional case, we noted that a random variable per definition follows a (multivariate) Gaussian distribution if and only if every linear combination of its components follows a Gaussian distribution. This made it obvious that the Gaussian distribution is closed under marginalization — we simply ignore the components we want to marginalize over in the linear combination.
To show that an arbitrary dimensional Gaussian distribution is closed under conditioning, we had to rely on a mathematical trick called ‘‘completing the square’’, as well as certain properties of matrices few mortals can remember. In conclusion, I think we should celebrate the fact that frequent operations such as marginalizing and conditioning do not expel us from the wonderful land of the Gaussians.5
I would like to thank Don van den Bergh and Sophia Crüwell for helpful comments on this blogpost.
Σ−1 is the main object of interest in Gaussian graphical models. This is because of another special property of the Gaussian: if the off-diagonal element (i,j) in Σ−1 is zero, then the variables Xi and Xj are conditionally independent given all the other variables — there is no edge between those two variables in the graph. ↩
You might enjoy training your intuitions about correlations on http://guessthecorrelation.com/. ↩
See also Dennis Lindley’s paper The philosophy of statistics. ↩
This follows from the Gaussian integral ∫∞−∞e−x2=√π, see here. For more on why π and e feature in the Gaussian density, see this. ↩
The land of the Gaussians is vast: its inhabitants — all Gaussian distributions — are also closed under multiplication and convolution. This might make for a future blog post. ↩