2026-06-04
When estimating regression parameters, the Gaussian distribution is commonly used. Often what we need to do is figure out the mean and/or the variance of the Gaussian distribution that best fit the data given a particular model structure.
However, for technical reasons, it is sometimes (actually, more often than we would care to advertize broadly !) very difficult to reach convergence for a particular parameter. Visually, a trace plot would look like this
Even if you run the model for many, many (many !) iterations, it never seems to converge.
What should we do ?
There is a very cool trick that can help us here.
Before we start to discuss this trick, it is important to know that sampling a standard Gaussian distribution (\(\mathcal{N}(0,1)\)) is very straight forward computationally. So, the closer we get to a standard Gaussian distribution the better it is.
If we think about it, the Gaussian distribution can be translated and scaled. If we can find a way to do this mathematically, we can incorporate this into our estimation procedure.
Any ideas how to do this ?
Translation
Translation
Mathematically, translation is the equivalent of adding or subtracting a value from the mean of the distribution.
This means that
\[\mathcal{N}(\mu, \sigma^2)\]
is exactly the same as
\[\mathcal{N}(0, \sigma^2) + \mu\]
Scaling
Scaling
Mathematically, scaling amounts to multiplying the Gaussian distribution by a positive number.
This means that
\[\mathcal{N}(\mu, \sigma^2)\]
is exactly the same as
\[\mathcal{N}(\mu, 1) \times \sigma^2\]
The convergence trick amounts to sampling a standard Gaussian distribution and adjusting its mean and variance from outside the distribution
\[\mathcal{N}(0, 1) \times \sigma^2 + \mu\] When implementing an MCMC in Stan (or any other such software), this trick allows for convergence to be much more efficient.
The example I gave above is straight forward to visualize but is badly adapted to most problems because most regression models require that many parameters be sampled at once.
The example I gave above is straight forward to visualize but is badly adapted to most problems because most regression models require that many parameters be sampled at once.
To do this, we need to work with a multivariate Gaussian distribution.
The example I gave above is straight forward to visualize but is badly adapted to most problems because most regression models require that many parameters be sampled at once.
To do this, we need to work with a multivariate Gaussian distribution.
The good news is that the convergence trick works also with a multivariate Gaussian distribution. However, we need to rely on matrix algebra to translate and scale a multivariate Gaussian distribution properly.
The example I gave above is straight forward to visualize but is badly adapted to most problems because most regression models require that many parameters be sampled at once.
To do this, we need to work with a multivariate Gaussian distribution.
The good news is that the convergence trick works also with a multivariate Gaussian distribution. However, we need to rely on matrix algebra to translate and scale a multivariate Gaussian distribution properly.
To show how our convergence trick works for a multivariate Gaussian distribution, let’s first visualize the two dimensional version of this distribution.
\[\mathcal{MVN}\left( \begin{bmatrix} 0\\ 0\\ \end{bmatrix}, \begin{bmatrix} 2 & -1\\ -1 & 2\\ \end{bmatrix}\right)\]
For a multivariate distribution, a translation amounts to adding a vector of values to make the translation.
Mathematically, this means that
\[\mathcal{MVN}\left( \begin{bmatrix} \mu_1\\ \vdots\\ \mu_n\\ \end{bmatrix}, \mathbf{\Sigma}\right)=\mathcal{MVN}\left( \begin{bmatrix} 0\\ \vdots\\ 0\\ \end{bmatrix}, \mathbf{\Sigma}\right) + \begin{bmatrix} \mu_1\\ \vdots\\ \mu_n\\ \end{bmatrix}\]
Unlike for the univariate Gaussian distribution, scalling for a multivariate distribution is a little trickier to perform… But mathematician and statistician have worked hard to figure out how to do this properly.
However, we need to delve a little deeper into matrix algebra to understand how to scale a multivariate Gaussian distribution.
First recall that a covariance matrix \(\mathbf{\Sigma}\) is a square matrix (i.e. it is an \(n\times n\) matrix).
To scale \(\mathbf{\Sigma}\), we cannot only multiply it by a scalar or even by a single matrix, we need to use the following matrix multiplication
\[\mathbf{L}\mathbf{\Sigma}\mathbf{L}^t\] where \(\mathbf{L}\) is a \(p\times n\) matrix of weight to be used for the scaling (a “scaling” matrix) and \(\mathbf{L}^t\) is its tranpose.
The technical reason why we need to use the equation above is to ensure that the resulting scaled covariance matrix also has an \(n \times n\) dimension.
If only \[\mathbf{L}\mathbf{\Sigma}\] is used the dimension of the resulting matrix also would be \(p \times p\).
Because in our problem weighting (or scaling) matrices is usually done with other covariance matrices, to apply the matrix scaling operation described previously, we need to find a way to square-root a matrix.
This where the genious of André-Louis Cholesky comes to the rescue.
André-Louis Cholesky discovered a matrix decomposition approach probably around 1902 (so when he was 27 years old!), although it was attributed to him a few years after his death.
The Cholesky decomposition allows to decompose a square matrix in a triangular matrix, which, when multiplied by its transposed will allow us to recover the initial matrix.
In coloquial terms, the Cholesky decomposition is the equivalent of a square root for matrices.
In math terms the Cholesky decomposition is defined as \[\mathbf{A} = \mathbf{L}\mathbf{L}^t\]
\[\mathbf{A} = \mathbf{L}\mathbf{L}^t\]
\[ \begin{bmatrix} 1 & 1 & 1\\ 1 & 5 & 5\\ 1 & 5 & 14\\ \end{bmatrix}= \begin{bmatrix} 1 & 0 & 0\\ 1 & 2 & 0\\ 1 & 2 & 3 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 & 1\\ 0 & 2 & 2\\ 0 & 0 & 3 \\ \end{bmatrix} \]
To scale the following multivariate Gaussian distribution \[\mathcal{MVN}\left(\boldsymbol{\mu},\mathbf{\Sigma}\right),\]
The following steps need to be applied
\[\mathbf{L}\cdot \mathcal{MVN}\left(\boldsymbol{\mu},\mathbf{I}\right)\cdot\mathbf{L}^t.\]
Recall, that \(\mathbf{I}\) is the identity matrix.
If we apply translation and scaling together on a multivariate Gaussian distribution, we get
\[\mathbf{L}\cdot \mathcal{MVN}\left(\mathbf{0},\mathbf{I}\right)\cdot\mathbf{L}^t + \boldsymbol{\mu}\] When implementing in Stan some of the models we will discuss in this course, this convergence trick becomes very practical because it can lead a model to convergence much faster than without using this trick.
