Variational autoencoders
Table of Contents
Variational inference: main idea
Say we are interested in estimating \(p(z \mid x)\) where \(z\) are latent variables and \(x\) are observed variables. In the usual Bayesian setup,
If the integral in the denominator is intractable, we could consider performing MCMC. However, an alternative is variational inference: we try to find a tractable \(q(z \mid x, \phi)\) using optimization (with respect to \(\phi\)) that is similar enough to \(p(z \mid x)\) to provide useful inference. Specifically, we perform \(\min_{\phi} \text{KL}(q(\bullet \mid x, \phi) \;||\; p(\bullet \mid x))\).
The development of the variational inference concept (with references to researchers' names and their articles) is described in the paragraph entitled "Research on variational inference" in Blei (2017, p. 860).
ELBO
Let \(q\) be the density for \(Z\). Let \(p(x, z \mid \theta)\) be the joint density of \(X, Z\).
Select an arbitrary \(\theta\). Then for all \(x \in \mc{X}\),
where we have made use of Jensen's inequality (\(\log\) is concave). Thus \(\log p\) is pointwise greater than or equal to \(\text{ELBO}(p(x, \bullet \mid \theta), q)\).
The next Claim will show how this quantity is useful for us.
Claim: For fixed \(x\), fixed \(\theta\),
Proof:
\(\square\)
This implies that for arbitrary fixed \(\theta\) and fixed \(x\), maximizing \(\text{ELBO}(p(x, \bullet \mid \theta), q)\) with respect to \(q\) is equivalent to minimizing \(\text{KL}(q \;||\; p(\bullet \mid x, \theta))\) with respect to \(q\).
Lemma: For fixed \(x\), fixed \(\theta\),
Proof:
\(\square\)
Note that the equality from the Claim above,
holds for every possible distribution \(q\) on \(\mc{Z}\), so \(q\) itself could depend on \(x\).
In what follows, we will use \(\text{ELBO}(p(x, \bullet \mid \theta), q)\) as a proxy for \(\log p(x \mid \theta)\).
Using autoencoders in variational inference
I am not sure exactly the genealogy of the autoencoder concept, but the definition from Wikipedia (article on Autoencoders) was clear:
An autoencoder is defined by the following components:
Two sets: the space of encoded messages \(\mc{Z}\); the space of decoded messages \(\mc{X}\). Typically \(\mc{X}\) and \(\mc{Z}\) are Euclidean spaces, that is, \(\mc{X} = \mb{R}^m\) and \(\mc{Z} = \mb{R}^n\), with \(m > n\), and
Two parametrized families of functions: the encoder family \(E_{\phi }: \mc{X} \to \mc{Z}\), parametrized by \(\phi\), and the decoder family \(D_{\theta }: \mc{Z} \to \mc{X}\), parametrized by \(\theta\).
We attempt to perform \(\min_{\theta, \phi} L(\theta, \phi)\), where \(L(\theta, \phi)\) is of the form \((1/N) \sum_{n=1}^N \norm{x_i - D_{\theta}(E_{\phi}(x_i))}^2\).
"The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms" (Wikipedia, article on Autoencoders).
Kingma and Welling (2014) introduced the autoencoder concept to the variational inference problem, as follows.
We will let \(q(z) = q(z \mid x, \phi)\) where \(\phi\) are some parameters and where \(q(z \mid x, \phi)\) is a neural network model (the "encoder"). We will also have a neural network model for \(p(x \mid z, \theta)\) (the "decoder").
We write the result of our Lemma using these new assumptions:
We maximize ELBO with respect to (\(\theta\), \(\phi\)). Note the following:
- The larger is \(\E_{Z}[\log p(x \mid z, \theta)]\), the larger is \(q(z \mid x, \phi)\) for \(z\) for which \(p(x \mid z, \theta)\) is high.
- The larger is \(\text{KL}(q(\bullet \mid x, \phi) \;||\; p(\bullet \mid \theta))\), the higher penalty there is for \(q(\bullet \mid x, \phi)\) diverging from our prior \(p(\bullet \mid \theta)\).
Choosing specific functional forms
We will attempt to maximize
which requires us to be able to both evaluate the above two terms as well as take their derivatives with respect to the parameters (recall gradient descent, where we iterate \(w_{k+1} = w_k - t_k \nabla f(w_k)\) where \(w_k\) is the \(k\)th value of \((\theta, \phi)\) and \(f\) is our above expression).
Regarding evaluation: \(p(x \mid z, \theta)\) and \(q(z \mid x, \phi)\) will be rather complicated distributions since each is a neural network model, and thus it is unlikely that the two integrals can be evaluated. However, we can use Monte Carlo sampling to evaluate these terms (note that each is an expected value, and thus is an integral, and thus can be approximated by an average of draws).
Regarding differentiation: we wish to find
where \(Z\) has density \(q(z \mid x, \phi)\).
Using the fact that
and assuming we can exchange the derivative and integral signs in the first step of the following, we have
If we wished to apply Monte Carlo sampling, we would approximate this via
According to Kingma and Welling (2014, p. 3), this estimator "exhibits very high variance... and is impractical for our purposes." This is presumably because of the presence of the \(\log p(x_n \mid z_n, \theta)\) term.
Reparameterization
Let \(\varepsilon\) be a random variable with density \(p(\varepsilon)\). Assume there exists a \(g_{\phi}(\varepsilon, x)\) such that \(Z = g_{\phi}(\varepsilon)\). Then
so \(\mu_Z = \mu_{\varepsilon} \circ g^{-1}\). Therefore
Further, since we are assuming that \(\varepsilon\) and \(Z\) have density functions with respect to Lebesgue measure, \(d\mu_Z(z) = q(z | x, \phi) dz\) and \(d\mu_{\varepsilon}(\varepsilon) = p(\varepsilon) d\varepsilon\), it follows that
is
Since \(\phi\) does not appear in the formula for the density, we have
Note that there is now only one term inside the expectation rather than a product of two terms. In practice, it has been observed that this leads to greater stability of the Monte Carlo estimate of the desired quantity than the "naive" estimate we had before performing this reparameterization.
References
Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. (2017). "Variational inference: a review for statisticians"". Journal of the American Statistical Association, Vol. 112 , Iss. 518.
Kingma, Diederik P. and Max Welling. (2014). "Auto-Encoding Variational Bayes"". International Conference on Learning Representations (ICLR) 2014.
Wikipedia authors. (2026). "Autoencoder". In Wikipedia.
How to cite this article
Wayman, Eric Alan. (2026). Variational autoencoders. Eric Alan Wayman's technical notes. https://ericwayman.net/notes/variational-autoencoders/
@misc{wayman2026variational-autoencoders,
title={Variational autoencoders},
author={Wayman, Eric Alan},
journal={Eric Alan Wayman's technical notes},
url={https://ericwayman.net/notes/variational-autoencoders/},
year={2026}
}