I've been reading John Baez's series of posts on Information Geometry. I'm currently on part 6... Midway through the post he discusses Radon-Nikodym derivatives:
The formula for information gain looks more slick: $$\int_\Omega \log\left(\frac{d\mu}{d\nu}\right)d\mu$$ And by the way, in case you’re wondering, the $d$ here doesn’t actually mean much: we’re just so brainwashed into wanting a $dx$ in our integrals that people often use $d\mu$ for a measure even though the simpler notation $\mu$ might be more logical. So, the function $\frac{d\mu}{d\nu}$ is really just a ratio of probability measures, but people call it a Radon-Nikodym derivative, because it looks like a derivative (and in some important examples it actually is). So, if I were talking to myself, I could have shortened this blog entry immensely by working with directly probability measures, leaving out the $d$'s, and saying:
Suppose $\mu$ and $\nu$ are probability measures; then the entropy of $\mu$ relative to $\nu$, or information gain, is $$S(\mu,\nu) = \int_\Omega \log\left(\frac{\mu}{\nu}\right)\mu$$
I understand the integral when formulated as (a log of) the Radon-Nikodym derivative... since that's just a function on elements of $\Omega$, the integral is just the Lebesgue integral with respect to $d\mu$. However, I don't understand how the second integral is defined... $\log\left(\frac{\mu}{\nu}\right)$ isn't a function of elements of $\Omega$, but if anything of subsets of $\Omega$ (and it clearly isn't itself a measure). What's the right way of thinking about this integral?
Intuitively, my first instinct is to break $\Omega$ into a bunch of disjoint subsets whose maximum measure according to $\mu$ is bounded, and take the limit of a sum over these sets as the bound decreases. Let's say for now all the measures involved are dominated by Lebesgue measure. Something like: Let $A_i$ be a set of subsets of $\Omega$ such that
- $\cup_i A_i = \Omega$
- $A_i \cap A_j = \emptyset$ when $i \ne j$
- $\max_i \mu(A_i) < \varepsilon$
Then $$ \int_\Omega \log\left(\frac{\mu}{\nu}\right)\mu \equiv \lim_{\varepsilon \to 0} \sum_i \log\left(\frac{\mu(A_i)}{\nu(A_i)}\right)\mu(A_i) $$
Clearly that isn't terribly rigorous, but is this on the right track conceptually? Or am I just deeply confused?