4

I've been attempting to understand the proof of the Donsker-Varadhan dual form of the Kullback-Liebler divergence, as defined by $$ \operatorname{KL}(\mu \| \lambda) = \begin{cases} \int_X \log\left(\frac{d\mu}{d\lambda}\right) \, d\mu, & \text{if $\mu \ll \lambda$ and $\log\left(\frac{d\mu}{d\lambda}\right) \in L^1(\mu)$,} \\ \infty, & \text{otherwise.} \end{cases} $$ with Donsker-Varadhan dual form $$ \operatorname{KL}(\mu \| \lambda) = \sup_{\Phi \in \mathcal{C}} \left(\int_X \Phi \, d\mu - \log\int_X \exp(\Phi) \, d\lambda\right). $$

Many of the steps in the proof are helpfully outlined here: Reconciling Donsker-Varadhan definition of KL divergence with the "usual" definition, and I can follow along readily.


However, a crucial first step is establishing that (for any function $\Phi$) $$\tag{1}\label{ineq} \operatorname{KL}(\mu\|\lambda)\ge \left\{\int \Phi d\mu-\log\int e^{\Phi}d\lambda\right\},$$ said to be an immediate consequence of Jensen's inequality. I can prove this easily in the case when $\mu \ll \lambda$ and $\lambda \ll \mu$:

$$ \operatorname{KL}(\mu\|\lambda) - \int \Phi d\mu = \int \left[ -\log\left(\frac{e^{\Phi}}{d\mu / d\lambda}\right) \right] d\mu \ge -\log \int \frac{e^{\Phi}}{d\mu / d\lambda} d\mu = -\log\int\exp(\Phi)d\lambda.$$ However, this last step appears to crucially rely on the existence of $d\lambda/d\mu$ and thus that $\lambda \ll \mu$, which isn't assumed by the overall theorem. Where I have been able to find proofs of the above in the machine learning literature, this assumption seems to be implicitly made, but I don't believe it is necessary and it is very restrictive.


My question is: how can we prove \ref{ineq} without assuming $\lambda \ll \mu$?

Biggs
  • 183

2 Answers2

2

Notice that if $D(\mu\|\lambda)$ is infinite then there's nothing to show. I'll assume that it is finite in the following.

First, take a bounded $\Phi$. Then $e^\Phi > 0$ everywhere, $\Phi$ is $\mu$-integrable, and $e^\Phi$ is $\lambda$-integrable. Consider the probability measure $\mathrm{d}\lambda' = \frac{e^\Phi}{Z} \mathrm{d}\lambda,$ where $Z = \int e^\Phi \mathrm{d}\lambda$. Notice that $\lambda \ll \lambda',$ and thus a fortiori, $\mu \ll \lambda'$, and $ \frac{\mathrm{d}\mu}{\mathrm{d}\lambda'} = Ze^{-\Phi} \frac{\mathrm{d}\mu}{\mathrm{d}\lambda}.$

Now observe that \begin{align} D(\mu \|\lambda') &= \int \log\left(Z{e^{-\Phi}} \frac{\mathrm{d}\mu}{\mathrm{d}\lambda} \right) \mathrm{d}\mu \\ &= \log Z -\int \Phi \mathrm{d}\mu + D(\mu\|\lambda).\end{align} By Gibbs' inequality, $D(\mu \|\lambda') \ge 0,$ and so we conclude that $$ D(\mu\|\lambda) \ge \int\Phi \mathrm{d}\mu - \log Z = \int \Phi \mathrm{d}\mu - \log \int e^\Phi \mathrm{d}\lambda. $$


This argument generalises directly to $\Phi$ that are bounded below $\lambda$-a.s. (since this is sufficient to get $\lambda \ll \lambda'$).

For an unbounded from below function $\Phi$, approximate the same by decreasing functions $\Phi_n$ converging pointwise to $\Phi$ such that $\Phi_1$ is integrable. For instance, we can take $\Phi_1 = \max(0, \Phi),$ in which case $$ \int e^{\Phi_1} = \lambda(\Phi \le 0) + \int e^{\Phi} \mathbf{1}\{\Phi > 0\} \le 1 + \int e^{\Phi}.$$

Now by the monotone convergence (applied to $e^{\Phi_1} - e^{\Phi_n}$), $\int e^{\Phi_n} \mathrm{d}\lambda \to \int e^{\Phi} \mathrm{d}\lambda$. And, of course, since $\Phi \le \Phi_n, \int \Phi \mathrm{d}\mu \le \int \Phi_n \mathrm{d}\mu.$ But then, for every $n$, \begin{align} D(\mu \|\lambda) &\ge \int \Phi_n \mathrm{d}\mu - \log \int e^{\Phi_n} \mathrm{d}\lambda\\ &\ge \int \Phi \mathrm{d}\mu - \log\int e^{\Phi_n} \mathrm{d}\lambda, \end{align} and the conclusion follows on taking limits (and using the continuity of $\log$).

  • It's been a little while since I mucked around with the limiting arguments to go from bounded to general functions, please let me know if I've made an error. – stochasticboy321 Jul 23 '21 at 17:32
2

A completely different approach is to observe that for any $f \ge 0$ $\lambda$-a.s., and denoting $\theta = \frac{\mathrm{d}\mu}{\mathrm{d}\lambda},$ $$ \int \frac{f}{\theta} \mathrm{d}\mu \le \int f \mathrm{d}\lambda.$$

Note that this would be enough, because in the chain of inequalities you produced, we can use $$ \mathrm{KL} - \int \Phi \mathrm{d}\mu \ge - \log \left( \int \frac{e^\Phi}{\theta} \mathrm{d}\mu\right) \ge -\log(\int e^{\Phi} \mathrm{d}\lambda),$$ where we have used the fact that $-\log$ is a monotonically decreasing function.


To show the claim, first recall that if $E$ is any set such that $\mu(E) = 0,$ then for any measurable function $u$, $$ \int_E u \mathrm{d}\mu = 0.$$ This is normally proved by exploiting the definition of $\int$ as a supremum over integrals of simple functions smaller than $u$ (which are, of course, always finite). Since $\mu$ is a probability measure, we find that if $F$ is any set such that $\mu(F^c) = 0,$ then $\int u \mathrm{d}\mu = \int_F u \mathrm{d}\mu$.

Now, take $F = \{\theta > 0\}$. Then clearly $\mu(F^c) = 0$. But then $$ \int \frac{f}{\theta}\mathrm{d}\mu = \int \frac{f}{\theta}\mathbf{1}\{\theta > 0\} \cdot \theta \mathrm{d}\lambda = \int f \mathbf{1}\{\theta > 0\} \mathrm{d}\lambda \le \int f \mathrm{d}\lambda,$$ where the second equality exploits that $\frac{f}{\theta} \cdot \theta$ is unambiguously $f$ on $\{\theta > 0\},$ and the last inequality uses the non-negativity of $f$.