KL Divergence of Gaussians
Preliminary: KL Divergence
Kullback–Leibler (KL) Divergence, aka the relative entropy or I-divergence is a distance metric that quantifies the difference between two probability distributions. We denote the KL Divergence of $P$ from $Q$ with:
$$
D_\mathrm{KL} \left( P \Vert Q \right) = \sum_{x} P(x) \lg \frac{P(x)}{Q(x)}
$$
For distributions of continuous random variable:
$$
D_\mathrm{KL} \left( P \Vert Q \right) = \int p(x) \lg \frac{p(x)}{q(x)} \mathrm{d}x
$$
Clearly that KL Divergence is not symmetric, for $D_\mathrm{KL}(P \Vert Q) \neq D_\mathrm{KL}(Q \Vert P)$.
KL Divergence of Gaussians
Gaussian Distributions
Recall that
$$
\mathcal{N}(x; \mu, \Sigma) = \exp { - \frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)}
$$
And here’s also a trick in processing the index term: we know that $(x - \mu)^T \Sigma^{-1} (x - \mu) \in \mathbb{R}$, therefore with the trace trick we have:
$$
(x - \mu)^T \Sigma^{-1} (x - \mu) = \mathrm{tr} \left( (x - \mu)^T \Sigma^{-1} (x - \mu) \right) = \mathrm{tr} \left( (x - \mu)^T (x - \mu) \Sigma^{-1} \right)
$$
KL Divergence of two Gaussians
Consider two Gaussian distributions:
$$
p(x) = \mathcal{N}(x; \mu_1, \Sigma_1), q(x) = \frac{1}{\sqrt{(2 \pi)^k \vert \Sigma \vert}} \mathcal{N}(x; \mu_2, \Sigma_2)
$$
We have
$$
\begin{aligned}
D_\mathrm{KL} (p \Vert q)
&= \mathbb{E}_p \left[ \log p - \log q \right] \
&= \frac{1}{2} \mathbb{E}_p \left[ \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) + (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \right] \
&= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathbb{E}_p \left[ - (x - \mu_1)^T \Sigma_1^{-1} (x - \mu_1) + (x - \mu_2)^T \Sigma_2^{-1} (x - \mu_2) \right]
\end{aligned}
$$
Using the trick mentioned above we have
$$
\begin{aligned}
D_\mathrm{KL} (p \Vert q)
&= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathbb{E}_p \left[ \mathrm{tr} \left( (x - \mu_2)^T (x - \mu_2) \Sigma_2^{-1} \right) \right] - \mathbb{E}_p \left[ \mathrm{tr} \left( (x - \mu_1)^T (x - \mu_1) \Sigma_1^{-1} \right) \right] \
&= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \Sigma_2^{-1} \right] \right) - \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \Sigma_1^{-1} \right] \right) \
&= \frac{1}{2} \log \frac{\vert \Sigma_q \vert}{\vert \Sigma_p \vert} + \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \right] \Sigma_2^{-1} \right) - \mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \right] \Sigma_1^{-1} \right)
\end{aligned}
$$
Interestingly, $\mathbb{E}_p \left[ (x - \mu_1)^T (x - \mu_1) \right] = \Sigma_1$, so the last term above is equal to:
$$
\mathrm{tr} \left( \Sigma_1 \Sigma_1^{-1} \right) = k
$$
But the second term involves 2 difference distributions. Using the trick from:
$$
\begin{aligned}
\mathrm{tr} \left( \mathbb{E}_p \left[ (x - \mu_2)^T (x - \mu_2) \right] \Sigma_2^{-1} \right)
&= (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) + \mathrm{tr} \left( \Sigma_2^{-1} \Sigma_1 \right)
\end{aligned}
$$
And finally:
$$
\begin{aligned}
D_\mathrm{KL} (p \Vert q)
&= \frac{1}{2} \left[ (\mu_1 - \mu_2)^T \Sigma_2^{-1} (\mu_1 - \mu_2) - k + \log \frac{\vert \Sigma_2 \vert}{\vert \Sigma_1 \vert} + \mathrm{tr} \left( \Sigma_2 \Sigma_1 \right) \right]
\end{aligned}
$$
Further in programming we consider
$$
\begin{aligned}
D_\mathrm{KL} (p \Vert q)
&= \frac{1}{2 \vert \mathcal{X} \vert} \sum_{x \in \mathcal{X}} \sum_{i=1}^k \left[ -1 + \frac{(\mu_{1, i} - \mu_{2, i})^2}{\sigma_{2, i}} + \log \sigma_{2, i} - \log \sigma_{1, i} + \sigma_{1, i} \sigma_{2, i} \right]
\end{aligned}
$$
1 | mu1, logvar1, mu2, logvar2 = output |
KL Divergence from $\mathcal{N}(0, I)$
Recall that in models like VAE and cVAE, the evidence lower bound (VLB) is
$$
\mathrm{VLB} = \mathbb{E}_{q_\phi(z \vert x)} p_\theta (x \vert z) - D_\mathrm{KL} \left( q_\phi(z \vert x) \Vert p(z) \right)
$$
We want the optimal parameters that
$$
(\phi^*, \theta^*) = \argmax_{\phi, \theta} \mathrm{VLB}
$$
Note that to align with most literature, here the KL Divergence is from $p$ to $q$.
And we assume that $p(z) = \mathcal{N} (z; 0, I)$ and $q_\phi(z \vert x) = \mathcal{N} (z; \mu, \Sigma)$, where $\Sigma = \mathrm{diag} { \sigma_1, \dots, \sigma_k}$ for which:
$$
\begin{aligned}
D_\mathrm{KL} \left( q_\phi(z \vert x) \Vert p(z) \right)
&= \mathbb{E}{p(z)} \left[ \Vert \mu_2 \Vert^2_2 - k - \log\vert \Sigma_2 \vert \right] \
&= - \frac{1}{2 \vert \mathcal{X} \vert} \sum{x \in \mathcal{X}} \sum_{i=1}^k \left[ 1 + \left(\log \sigma_{2,i} \right) - \mu_{2i}^2 - \sigma_{2, i} \right]
\end{aligned}
$$
In coding, we get mu
and logvar
(which is $\log \Sigma$), and the KL Divergence loss term can be computed with:
1 | mu, logvar = output |
KL Divergence by Sampling
Pending…
References
- Mr.Esay’s Blog: KL Divergence between 2 Gaussian Distributions
- Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Sohn, K., Lee, H., & Yan, X. (2015). Learning structured output representation using deep conditional generative models. Advances in neural information processing systems, 28.
- Petersen, K. B., & Pedersen, M. S. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.
- Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The annals of mathematical statistics, 22(1), 79-86.
- Csiszár, I. (1975). I-divergence geometry of probability distributions and minimization problems. The annals of probability, 146-158.