3

I'm trying to prove the chain rule for relative entropy using measure theory, and the following problem showed up. Assume that $\mathcal X_1, \mathcal X_2$ are bosh polish.

Let $\mu_1:\mathcal X_1 \rightarrow [0,1]$, $\mu_2:\mathcal X_2 \rightarrow [0,1]$ and $\nu:\mathcal X_1 \times \mathcal X_2\rightarrow [0,1]$ (all probability measures).

Now, one can use the disintegration theorem to state that $d\nu(x,y) = d\nu_1(x) d\nu_2^x(y)$ for $x \in \mathcal X_1$ and $y \in \mathcal X_2$. Note that, given a joint distribution $(X,Y) \sim \nu$, the marginal distribution of the first coordinate is $\nu_1$, and $\nu_2^x$ is the conditional probability $\nu_2^x(A) = P(Y \in A \mid X = x)$.

With all that being said, if $\nu \ll \mu_1 \otimes \mu_2$, is it true that $\nu_1 \ll \mu_1$ and $\nu_2^x \ll \mu_2$ ? In other words, is the conditional probability measure absolutely continuous with respect to $\mu_2$ for every $y \in \mathcal X_2$?

  • If $\mu_1(A)=0$ but $\nu_1(A)\neq 0,$ then we can obtain $\nu(A\times \mathcal{X}_2)=\nu_1(A)\neq 0.$ This contradicts that $\nu<<\mu_1\otimes\mu_2.$ – Raghav Sep 08 '20 at 19:51
  • For $\nu_2^x,$ I guess one should be able obtain $\nu_2^x<<\mu$ for almost every $x\in \mathcal{X}_1,$ but I am not sure and I don’t see a proof on top of my head. – Raghav Sep 08 '20 at 19:53
  • Thanks for the comment. The absolute continuity for the marginals is direct, as you point out. Now, the trouble with $\nu_2^x$ is that $\mu_1 ({x}) = 0$, so one cannot say that $ \mu_1(A) \mu(x) = 0 \iff \nu_2^x (A) = 0$. – Davi Barreira Sep 08 '20 at 19:58
  • You need a little more structure in order for things to go through, in general the kernels that come from the disintegration theorem need not even satisfy things like the measurability of ${x: \nu_2^x \ll \mu_2}$. See section 2.6 in these notes for a sufficient structure that does away with these issues and is rich enough for information theory. They also mention (remark 2.4) the use of Doob's version of the R-N theorem, which works if the space $\mathcal{X}_2$ is Polish. – stochasticboy321 Sep 08 '20 at 20:19
  • I forgot to add the assumption that both spaces are Polish, as this is usually necessary for using the Disintegration Theorem (actually there are conditions a bit weaker than this). Thanks for the notes! I will take a look. But with this added structure, can you actually prove what is stated in the question? – Davi Barreira Sep 08 '20 at 20:25

1 Answers1

0

Let $\mathcal X\times\mathcal Y$ be a measurable product space and $\mu$, $\nu$ probability measures on $\mathcal X\times\mathcal Y$ such that $\nu\ll\mu$. Then there exists a Radon-Nikodym derivative $R:\mathcal X\times\mathcal Y\rightarrow\mathbb R_{\ge 0}$ of $\nu$ with respect to $\mu$. If $\mathcal Y$ is a Borel space (equivalently countably generated, which covers Polish spaces with their Borel algebra, see here and here), a disintegration exists (Theorem 3.4 in Kallenberg's book, definition of Borel is the "same", above Theorem 1.8), and this is why we assume that $\mathcal Y$ is a Borel space.

Let the probability measure $\mu_1$ on $\mathcal X$ and the Markov kernel $\mu_x$, $x\in\mathcal X$, be a disintegration of $\mu$, as given by Theorem 3.4 above. Recall that $R_1(x)=\int R(x,y)\mu_x(\mathrm d y)$ is measurable. So, if we define $R_{2,x}(y)=0$ whenever $R_1(x)=0$ and $R_{2,x}(y)=R(x,y)/R_{1}(x)$, then $R_{2,x}$ is both $\mathcal X\times\mathcal Y$ and $\mathcal Y$-measurable (Lemma 1.28 in Kallenberg). Notice that $\int R_{2,x}(y)\mu_x(\mathrm d y)=1$ by the definition of $R_1$, whenever $R_1(x)\neq 0$. Also, notice that $\int R_1\mu_1(\mathrm d x)=\int\int R(x,y)\mu_x(\mathrm d y)\mu_1(\mathrm d x)=\int R(x,y)\mu(\mathrm dx,\mathrm d,y)=1$ is a Radon-Nikodym derivative. So, let $\nu_1$ be given by $R_1$ with respect to $\mu_1$. Since $\{x:R_1(x)=0\}$ is clearly a null set of $\nu_1$, we have $R_{2,x}(y)=R(x,y)/R_1(x)$ $\nu_1$-almost everywhere, and in particular we have $\nu_x$ given by the Radon-Nikodym derivative $R_{2,x}(y)$ with respect to $\mu_x$. Otherwise, let $y\in\mathcal Y$ be some fixed point and let $\nu_y$ be the one-point mass on $y$. Notice that $\nu_x$ is a Markov kernel since for any fixed $\mathcal E$ the map $f:\mathcal X\rightarrow[0,1]$, $x\mapsto\nu_x(\mathcal E)$ is given by $f(x)=\int R_{2,x}(y)\mu_x(\mathrm dy)$ on $R_1(x)\neq 0$ and $f(x)=\unicode{120793}\{y\in\mathcal E\}$ otherwise, both of which are measurable. Now, we show that $\nu_1$, $\nu_x$ is a disintegration of $\nu$. Notice that \begin{align*} \int f(x,y)\nu(\mathrm dx,\mathrm dy) &=\int R(x,y)f(x,y)\mu(\mathrm dx,\mathrm dy)=\int R_1(x)\int R_{2,x}(y)f(x,y)\mu_x(\mathrm{d}y)\mu_1(\mathrm{d}x)\\ &=\int\int f(x,y)\nu_x(\mathrm dy)\nu_1(\mathrm dx). \end{align*} The second step is justified by the fact that the $\{R_1(x)=0\}$ contribution to the integral is $0$. This shows that $\nu_1$, $\nu_x$ is a disintegration of $\nu$. Finally, by definition $\nu_x$ is absolutely continuous with respect to $\mu_x$, well, $\nu_1$-almost everywhere (which is all that we need).

I run into problems like this all the time, that is, conditional probabilities, conditional Radon-Nikodym derivatives, conditional relative entropies (as of now), where I struggle to figure out the general definitions (in this case conditional derivatives), only to realize that they are straightforward, the problem only being that we need additional assumptions to ensure that this, only possible, definition is reasonable (usually Borel spaces, which give access to kernel theory, or we just work with kernels in the first place, then we don't need the assumption. Potato potato).

Matija
  • 3,526