Processing math: 100%

Fabian Dablander Postdoc Behavioural Data Science & Sustainability

Two properties of the Gaussian distribution

In a previous blog post, we looked at the history of least squares, how Gauss justified it using the Gaussian distribution, and how Laplace justified the Gaussian distribution using the central limit theorem. The Gaussian distribution has a number of special properties which distinguish it from other distributions and which make it easy to work with mathematically. In this blog post, I will focus on two of these properties: being closed under (a) marginalization and (b) conditioning. This means that, if one starts with a p-dimensional Gaussian distribution and marginalizes out or conditions on one or more of its components, the resulting distribution will still be Gaussian.

This blog post has two parts. First, I will introduce the joint, marginal, and conditional Gaussian distributions for the case of two random variables; an interactive Shiny app illustrates the differences between them. Second, I will show mathematically that the marginal and conditional distribution do indeed have the form I presented in the first part. I will extend this to the p-dimensional case, demonstrating that the Gaussian distribution is closed under marginalization and conditioning. This second part is a little heavier on the mathematics, so if you just want to get an intuition you may focus on the first part and simply skip the second part. Let’s get started!

The Land of the Gaussians

In the linear regression case discussed previously, we have modeled each individual data point yi as coming from a univariate conditional Gaussian distribution with mean μ=xTib and variance σ2. In this blog post, we introduce the random variables X1 and X2 and assume that both are jointly normally distributed; we are going from p=1 to p=2 dimensions. The probability density function changes accordingly — it becomes a function mapping from two to one dimension, i.e., f:R2R+.

To simplify notation, let x=(x1,x2)T and μ=(μ1,μ2)T be two 2-dimensional vectors denoting one observation and the population means, respectively. For simplicity, we set the population means to zero, i.e. μ=(0,0). In one dimension, we had just one parameter for the variance σ2; in two dimensions, this becomes a symmetric 2×2 covariance matrix

Σ=(σ21ρσ1σ2ρσ1σ2σ22),

where σ21 and σ22 are the population variances of the random variables X1 and X2, respectively, and ρ is the population correlation between the two. The general form of the density function of a p-dimensional Gaussian distribution is

f(xμ,Σ)=(2π)p/2|Σ|1/2exp(12(xμ)TΣ1(xμ)),

where x and μ are a p-dimensional vectors, Σ1 is the (p×p)-dimensional inverse covariance matrix and |Σ| is its determinant.1 We focus on the simpler 2-dimensional, zero-mean case. Observe that

Σ1=1|Σ|(σ22ρσ1σ2ρσ1σ2σ21)=1σ21σ22(1ρ2)(σ22ρσ1σ2ρσ1σ2σ21),

which we use to expand the bivariate Gaussian density function:

f(x,yσ21,σ22,ρ)=12πσ21σ22(1ρ2)exp(12σ21σ22(1ρ2)(x1x2)T(σ22ρσ1σ2ρσ1σ2σ21)(x1x2))=12πσ21σ22(1ρ2)exp(12σ21σ22(1ρ2)(x1σ22x2ρσ1σ2x2σ21x1ρσ1σ2)T(x1x2))=12πσ21σ22(1ρ2)exp(12σ21σ22(1ρ2)[σ22x212ρσ1σ2x1x2+σ21x22]).

The figure below plots the contour lines of nine different bivariate normal distributions with mean zero, correlations ρ[0,0.3,0.7], and standard deviations σ1,σ2[1,2].

plot of chunk unnamed-chunk-1

In the top row, all bivariate Gaussian distributions have ρ=0 and look like a circle for standard deviations of equal size. The top middle plot is stretched along X2, giving it an elliptical shape. The middle and last row show how the distribution changes for negative (ρ=0.3) and positive (ρ=0.7) correlations.2

In the remainder of this blog post, we will take a closer look at two operations: marginalization and conditioning. Marginalizing means ignoring, and conditioning means incorporating information. In the zero-mean bivariate case, marginalizing out X2 results in

f(x1)=12πσ21exp(12σ21x21),

which is a simple univariate Gaussian distribution with mean 0 and variance σ21. On the other hand, incorporating the information that X2=x2 results in

f(x1x2)=12πσ21(1ρ2)exp(12σ21(1ρ2)(x1ρσ1σ2x2)2),

which has mean ρσ1σ2x2 and variance σ21(1ρ2). The next section provides two simple examples illustrating the difference between these two types of distributions, as well as a simple Shiny app that allows you to build an intuition for conditioning in the bivariate case.

Two examples and a Shiny app

Let’s illustrate the difference between marginalization and conditioning on two simple examples. First, assume that the correlation is very high with ρ=0.8, and that σ21=σ22=1. Then, observing for example X2=2, our belief about X1 changes such that its mean gets shifted to the observed x2 value, i.e. μ1=0.82=1.6 (indicated by the dotted line in the Figure below). The variance of x1 gets substantially reduced, from 1 to (10.82)=0.36. This is what the left part in the Figure below illustrates. If, on the other hand, ρ=0 such that X1 and X2 are not related, then observing X2=2 changes neither the mean of X1 (it stays at zero), nor its variance (it stays at 1); see the right part of the Figure below. Note that the marginal and conditional densities are multiplied with a constant to make them better visible.

plot of chunk unnamed-chunk-2

To explore the relation between joint, marginal, and conditional Gaussian distributions, you can play around with a Shiny app following this link. In the remainder of the blog post, we will prove that the two distributions given above are in fact the marginal and conditional distributions in the two-dimensional case. We will also generalize these results to p-dimensional Gaussian distributions.

The two rules of probability

In the second part of this blog post, we need the two fundamental ‘rules’ of probability: the sum and the product rule.3 The sum rule states that

p(x)=p(x,y)dy,

and the product rule states that

p(x,y)=p(xy)p(y)=p(yx)p(x).

In the remainder, we will see that a joint Gaussian distribution can be factorized into a conditional Gaussian and a marginal Gaussian distribution.

Property I: Closed under Marginalization

The first property states that if we marginalize out variables in a multivariate Gaussian distribution, the result is still a Gaussian distribution. The Gaussian distribution is thus closed under marginalization. Below, I will show this for a bivariate Gaussian distribution directly, and for an arbitrary dimensional Gaussian distributions by thinking rather than computing. This illustrates that knowing your definitions can help avoid tedious calculations.

2-dimensional case

To show that the marginalisation property holds for the bivariate Gaussian distribution, we need to solve the following integration problem

X2f(x1,x2σ21,σ22,ρ)dx2,

and check whether the result is a univariate Gaussian distribution. We tackle the problem head on and expand

X212πσ21σ22(1ρ2)exp(12σ21σ22(1ρ2)[σ22x212ρσ1σ2x1x2+σ21x22])dx2=12πσ21σ22(1ρ2)exp(12σ21σ22(1ρ2)σ22x21)X2exp(12σ21σ22(1ρ2)[σ21x222ρσ1σ2x1x2])dx2=12πσ21σ22(1ρ2)exp(12σ21σ22(1ρ2)σ22x21)X2exp(12σ22(1ρ2)[x222ρσ2σ1x1x2])dx2.

Putting everything that does not involve x2 outside the integral, we’ve come quite far! Note that we can “complete the square”, that is, write

x222ρσ2σ1x1x2=(x2ρσ2σ1x1)2ρ2σ22σ21x21.

This leads to

12πσ21σ22(1ρ2)exp(12σ21σ22(1ρ2)σ22x21)X2exp(12σ22(1ρ2)[(x2ρσ2σ1x1)2ρ2σ22σ21x21])dx2=12πσ21σ22(1ρ2)exp(12σ21σ22(1ρ2)σ22x21+12σ22(1ρ2)ρ2σ22σ21x21)X2exp(12σ22(1ρ2)(x2ρσ2σ1x1)2)dx2=12πσ21σ22(1ρ2)exp(12σ21(1ρ2)x21+12σ21(1ρ2)ρ2x21)X2exp(12σ22(1ρ2)(x2ρσ2σ1x1)2)dx2=12πσ21σ22(1ρ2)exp(x21ρ2x212σ21(1ρ2))X2exp(12σ22(1ρ2)(x2ρσ2σ1x1)2)dx2=12πσ21σ22(1ρ2)exp(x21(1ρ2)2σ21(1ρ2))X2exp(12σ22(1ρ2)(x2ρσ2σ1x1)2)dx2=12πσ21σ22(1ρ2)exp(12σ21x21)X2exp(12σ22(1ρ2)(x2ρσ2σ1x1)2)dx2.

We are nearly done! What’s left is to realize that the integrand is the kernel of a univariate Gaussian distribution with mean ρσ2σ1x1 and variance σ22(1ρ2) — it’s an unnormalized conditional Gaussian distribution! The thing that makes a Gaussian distribution integrate to 1, as all distributions must, is the normalizing constant in front, the strange term involving π. For this particular distribution, the normalizing constant is 2πσ22(1ρ2).4

Continuing, we arrive at the solution

12πσ21σ22(1ρ2)exp(12σ21x21)2πσ22(1ρ2)=12πσ21exp(12σ21x21),

which is the density function of the univariate Gaussian distribution (with mean zero). With some work, we have shown that marginalizing out a variable in a bivariate Gaussian distribution leads to a univariate Gaussian distribution. This process ‘removes’ any occurances of the correlation ρ and the other variable x2.

Granted, this process was rather tedious and not at all general (but good practice!) — does it also work when going from 3 to 2 dimensions? Will the remaining bivariate distribution be Gaussian? What if we go from 200 dimensions to 97 dimensions?

p-dimensional case

A more elegant way to see that a p-dimensional Gaussian distribution is closed under marginalization is the following. First, we note the requirement that a random variable needs to fulfill in order to have a (multivariate) Gaussian distribution.

Definition. X=(X1,,Xp)T has a multivariate Gaussian distribution if every linear combination of its components has a (multivariate) Gaussian distribution. Formally,

XN(μ,Σ)if and only ifAXN(Aμ,AΣAT),

see for example Blitzstein & Hwang (2014, pp. 309-310).

Second, from this it immediately follows that any subset of random variables HX are themselves normally distributed, and the mean and covariance is given by simply ignoring all elements that are not in H; this is called the marginalisation property. In particular, we choose a linear transformation that simply ignores the components we want to marginalize out. As an example, let’s take the trivariate Gaussian distribution

(X1X2X3)N((μ1μ2μ3),(σ21ρ12σ1σ2σ22ρ13σ1σ3ρ23σ2σ3σ23)),

which has a three-dimensional mean vector and adds a variance and two correlations to the (symmetric) covariance matrix. Define

A=(100010000),

which picks out the components X1 and X2 and ignores X3. Putting this into the equality from the definition, we arrive at

(100010000)(X1X2X3)N((100010000)(μ1μ2μ3),(100010000)(σ21ρ12σ1σ2σ22ρ13σ1σ3ρ23σ2σ3σ23)(100010000))(X1X2)N((μ1μ2),(σ21ρ12σ1σ2σ22)).

Two points to wrap up. First, it helps to know your definitions. Second, in the Gaussian case, computing marginal distributions is trivial. Conditional distributions are a bit harder, unfortunately. But not by much.

Property II: Closed under Conditioning

Conditioning means incorporating information. The fact that Gaussian distributions are closed under conditioning means that, if we start with a Gaussian distribution and update our knowledge given the observed value of one of its components, then the resulting distribution is still Gaussian — we never have to leave the wonderful land of the Gaussians! In the following, we prove this first for the simple bivariate case, which should also give some intuition as to how conditioning differs from marginalizing, and then provide the more general expression for p dimensions.

Instead of ignoring information, as we did when computing marginal distributions above, we now want to incorporate information we have about the other random variable X2. Conditioning implies learning: how does our knowledge that X2=x2 change our knowledge about X1?

2-dimensional case

Let’s say we observe X2=x2. How does that change our beliefs about X1? The product rule above leads to Bayes’ rule (via simple division), which is exactly what we need:

f(x1x2)=f(x1,x2)f(x2),

where we have suppressed conditioning on the parameters ρ,σ21,σ22 to avoid cluttered notation. Let’s do some algebra! We write

f(x1x2)=12πσ21σ22(1ρ2)exp(12σ21σ22(1ρ2)[σ22x212ρσ1σ2x1x2+σ21x22])12πσ22exp(12σ22x22)=12πσ21(1ρ2)exp(12σ21σ22(1ρ2)[σ22x212ρσ1σ2x1x2+σ21x22]+12σ22x22),

which already looks promising. Putting the x22 term into the angular brackets, we should see a nice quadratic formula:

12πσ21(1ρ2)exp(12σ21σ22(1ρ2)[σ22x212ρσ1σ2x1x2+σ21x222σ21σ22(1ρ2)2σ22x22])=12πσ21(1ρ2)exp(12σ21σ22(1ρ2)[σ22x212ρσ1σ2x1x2+σ21x22σ21(1ρ2)x22])=12πσ21(1ρ2)exp(12σ21σ22(1ρ2)[σ22x212ρσ1σ2x1x2+σ21x22σ21x22+σ21ρ2x22])=12πσ21(1ρ2)exp(12σ21σ22(1ρ2)[σ22x212ρσ1σ2x1x2+σ21ρ2x22])=12πσ21(1ρ2)exp(12σ21(1ρ2)[x212ρσ1σ2x1x2+σ21σ22ρ2x22])=12πσ21(1ρ2)exp(12σ21(1ρ2)(x1ρσ1σ2x2)2).

Done! The conditional distribution has a mean of ρσ1σ2x2 and a variance of σ21(1ρ2). How does this look like in p dimensions?

p-dimensional case

We need a little bit more notation for the crazy ride we’re about to embark on. Let x=(x1,,xn)T be an n-dimensional vector and y=(y1,,ym)T an m-dimensional vector which both are jointly Gaussian distributed with covariance matrix ΣR(n+m)×(n+m). Note that we can write Σ as a block matrix, i.e.,

Σ=(ΣxxΣxyΣyxΣyy),

where Σxx and Σyy are the covariance matrices of x and y, respectively, and Σxy=(Σyx)T gives the covariance between x and y. We remember the density function of a multivariate Gaussian distribution from above, and take a first stab:

f(xy)=f(x,y)f(y)=(2π)(n+m)/2|Σ|1/2exp(12[(xy)TΣ1(xy)])(2π)m/2|Σyy|1/2exp(12[yTΣ1yyy])=(2π)n/2|Σ|1/2|Σyy|1/2exp(12[(xy)T(ΣxxΣxyΣyxΣyy)1(xy)]+yTΣ1yyy)=(2π)n/2|Σ|1/2|Σyy|1/2exp(12[(xy)T(ΣxxΣxyΣyxΣyy)1(xy)yTΣ1yyy]).

There’s only a slight problem. The inverse of the block matrix is pretty ugly:

Σ1=((ΣxxΣxyΣ1yyΣyx)1(ΣxxΣxyΣ1yyΣxy)1ΣxyΣ1yyΣ1yyΣyx(ΣxxΣxyΣ1yyΣxy)1Σ1yy+Σ1yyΣyx(ΣxxΣxyΣ1yyΣxy)1ΣxyΣ1yy),

where ΣxxΣxyΣ1yyΣxy is the Schur complement of Σxx in the block matrix above. Let’s be lazy and delay computation by simply renaming the relevant parts, i.e.,

Ω=(ΩxxΩxyΩyxΩyy)=Σ1.

Proceeding bravely, we write

f(xy)=(2π)n/2|Σ|1/2|Σyy|1/2exp(12[(xy)T(ΩxxΩxyΩyxΩyy)(xy)yTΣ1yyy])=(2π)n/2|Σ|1/2|Σyy|1/2exp(12[(xTΩxx+yTΩyxxTΩxy+yTΩyy)T(xy)yTΣ1yyy])=(2π)n/2|Σ|1/2|Σyy|1/2exp(12[xTΩxxx+yTΩyxx+xTΩxyy+yTΩyyyyTΣ1yyy])=(2π)n/2|Σ|1/2|Σyy|1/2exp(12[xTΩxxx+2xTΩxyy+yT(ΩyyΣ1yy)y]),

where we get the last line by noting that yTΩyxx=(xTΩxyy)T, i.e. they give the same scalar. It is also important to keep in mind that ΩyyΣ1yy.

There is hope: we are in an analogous situation as in the two-dimensional case described above. Somehow we must be able to ‘‘complete the square’’ in the more general p-dimensional case, too.

Scribbling on paper for a bit, we dare to conjecture that the conditional distribution is

f(xy)=(2π)n/2|Σ|1/2|Σyy|1/2exp(12(x+Ω1xxΩxyy)TΩxx(x+Ω1xxΩxyy)),

which expands into

=(2π)n/2|Σ|1/2|Σyy|1/2exp(12[xTΩxxx+xTΩxxΩ1xxΩxyy+yTΩTxyΩTxxΩxxx+yTΩTxyΩTxxΩxxΩ1xxΩxyy])=(2π)n/2|Σ|1/2|Σyy|1/2exp(12[xTΩxxx+2xTΩxyy+yTΩyxΩ1xxΩxyy]).

For our conjecture to be true, it must hold that

ΩyyΣ1yy=ΩyxΩ1xxΩxy.

Indeed, remember that

Σ1=(ΩxxΩxyΩyxΩyy)=((ΣxxΣxyΣ1yyΣyx)1(ΣxxΣxyΣ1yyΣxy)1ΣxyΣ1yyΣ1yyΣyx(ΣxxΣxyΣ1yyΣxy)1Σ1yy+Σ1yyΣyx(ΣxxΣxyΣ1yyΣxy)1ΣxyΣ1yy),

and therefore

ΩyxΩ1xxΩxy=Σ1yyΣyxI(ΣxxΣxyΣ1yyΣxy)1(ΣxxΣxyΣ1yyΣyx)((ΣxxΣxyΣ1yyΣxy)1ΣxyΣ1yy)=Σ1yyΣyx((ΣxxΣxyΣ1yyΣxy)1ΣxyΣ1yy)=Σ1yyΣyx(ΣxxΣxyΣ1yyΣxy)1ΣxyΣ1yy=Σ1yy+Σ1yyΣyx(ΣxxΣxyΣ1yyΣxy)1ΣxyΣ1yyΩyyΣ1yy.

This means that we have correctly completed the square! To clean up the business of the determinants, note that the determinant of a block matrix factors such that

|Σ|=|ΣxxΣxyΣ1yyΣyx|×|Σyy|.

Substituting this into our equation for the conditional density, as well as substituting all the Ω’s with Σ’s, results in

f(xy)=(2π)n/2|Σ|1/2|Σyy|1/2exp(12(x+Ω1xxΩxyy)TΩxx(x+Ω1xxΩxyy))=(2π)n/2|ΣxxΣxyΣ1yyΣyx|1/2exp(12(xΣxyΣ1yyy)T(ΣxxΣxyΣ1yyΣyx)1(xΣxyΣ1yyy)).

Thus, if (x,y) are jointly normally distributed, then incorporating the information that Y=y leads to a conditional distribution f(xy) that is Gaussian with conditional mean ΣxyΣ1yyy and conditional covariance ΣxxΣxyΣ1yyΣyx.

Conclusion

In this blog post, we have seen that the Gaussian distribution has two important properties: it is closed under (a) marginalization and (b) conditioning. For the bivariate case, an accompanying Shiny app hopefully helped to build some intuition about the difference between these two operations.

For the general p-dimensional case, we noted that a random variable per definition follows a (multivariate) Gaussian distribution if and only if every linear combination of its components follows a Gaussian distribution. This made it obvious that the Gaussian distribution is closed under marginalization — we simply ignore the components we want to marginalize over in the linear combination.

To show that an arbitrary dimensional Gaussian distribution is closed under conditioning, we had to rely on a mathematical trick called ‘‘completing the square’’, as well as certain properties of matrices few mortals can remember. In conclusion, I think we should celebrate the fact that frequent operations such as marginalizing and conditioning do not expel us from the wonderful land of the Gaussians.5


I would like to thank Don van den Bergh and Sophia Crüwell for helpful comments on this blogpost.

Footnotes

  1. Σ1 is the main object of interest in Gaussian graphical models. This is because of another special property of the Gaussian: if the off-diagonal element (i,j) in Σ1 is zero, then the variables Xi and Xj are conditionally independent given all the other variables — there is no edge between those two variables in the graph. 

  2. You might enjoy training your intuitions about correlations on http://guessthecorrelation.com/

  3. See also Dennis Lindley’s paper The philosophy of statistics

  4. This follows from the Gaussian integral ex2=π, see here. For more on why π and e feature in the Gaussian density, see this

  5. The land of the Gaussians is vast: its inhabitants — all Gaussian distributions — are also closed under multiplication and convolution. This might make for a future blog post.