How does this optimal classifier make sense in case of continuous random variable?

Question

I'm reading about The Bayes Problem in textbook A Probabilistic Theory of Pattern Recognition by Devroye et al.

They make use of $\eta(x)=\mathbb{P}\{Y=1 \mid X=x\}$ throughout the proof.

In my understanding, the conditional probability $\eta(x)=\mathbb{P}\{Y=1 \mid X=x\}$ is defined only when $\mathbb P \{X=x\} > 0$. If $X$ is continuous, for example, $X$ follows normal distribution, then $\mathbb P[X=x]=0$ for all $x \in \mathbb R$. Then $\eta(x)$ is undefined for all $x \in \mathbb R$, confusing me.

Could you please elaborate on this point?

Elle Najt · Accepted Answer · 2020-09-10T18:14:40.337

Some comments:

You can get intuition from assuming that the set up is that $(X,Y)$ is some process where $Y$ is sampled from a distribution that depends on the realization of $X$. For instance, maybe $X \sim Unif([0,1])$, and $Y$ is a sample from an independent coin with bias $X$. Conditioned on $X = 1/2$, $Y$ is a fair coin. This is pretty close to the learning theory context anyway -- there are some features, $X$, and the class $Y$ is some random function of the features.

This situation is also essentially general, in a way that is made precise in 3. So, there's really no harm in imagining that this is the story with the data you are trying to learn a classifier for. (Since $Y$ is a binary random variable, you can skip to 5.)
If $(X,Y)$ has a continuous pdf $p(x,y)$, then you can define $p_x(y) = \frac{ p(x,y)}{ \int_{\mathbb{R}} p(x,y) dy }$ as the pdf of $Y$ conditioned on $X = x$. You need that the integral in the denominator is nonzero, but this is a weaker condition than $P(X = x) > 0$. In this specific case, $Y$ is a binary variable, so we'd have $p_x(y) = \frac{ p(x,y)}{p(x,0) + p(x,1)}$. See wikipedia for more, though I'll now discuss some of the formalism.
You can define a notion of conditional probability for measure zero sets, called disintegration of measure. Its really not necessary for learning theory, and since building it in general is pretty technical, I wouldn't worry about it unless it interests you (if it does, then the survey on wikipedia by Chang and Pollard is worth reading, as is Chapter 5 in Pollard's "User's Guide"). One important comment though is that you have to build up all of the conditional distributions at once, they are defined a.e. as a family in the distribution over $X$. Otherwise, you have problems like this: https://en.wikipedia.org/wiki/Borel%E2%80%93Kolmogorov_paradox

You can verify that $p_x(y)$ as defined above actually gives a disintegration. I'm not sure what conditions are necessary for this to hold, other than that $p_x(y)$ is well defined, and all the integrals you write down in that verification make sense. In particular, I don't think that $p(x,y)$ needs to be a continuous pdf, but would want to find a reference to double check.

Here's a sketch of the verification, for notation $\mu_x, \nu$ see wikipedia. (Note that there is some notation class -- what they call $Y$ is here called $X \times Y$): The pushforward measure is $d \nu(x) = (\int_{\mathbb{R}} p(x,y) dy) dx$. $\mu_x(y) = p_x(y) dy$ on the fiber $\{x\} \times \mathbb{R}$. When you plug this into the formula from wikipedia , $\int_X (\int_{\pi^{-1}(x)} f(x,y) d \mu_x(y) ) d\nu(x)$, you get:

$$\int_{\mathbb{R}} \int_{\mathbb{R}} f(x,y) \frac{ p(x,y)}{ \int_{\mathbb{R}} p(x,y) dy } dy (\int_{\mathbb{R}} p(x,y) dy) dx = \int_{\mathbb{R}^2} f(x,y) p(x,y) dxdy.$$

From the learning theory point of view, I think it makes sense to imaging fixing a disintegration, and treating that as the notion of conditional probability for $Y$. Even though it is only defined a.e. in $X$, you are not classifying some arbitrary $X$, but one produced from the distribution. Thus, you'll never 'see' disagreements between two different fixed choices of disintegrations. In particular, you can take particularly nice disintegrations given by the formula $p_x(y)$. Also, this means you can treat your distribution as if it is of the kind described in the first bullet.
If $Y$ is a $\{0,1\}$ random variable, $P(Y = 1) = \mathbb{E}[Y]$. Another way that we can define $P ( Y = 1 | X = x) = E [ Y | X = x]$ is via conditioning; the random variable $E [ Y |X ]$ is $\sigma(X)$ measurable, so there is a measurable function $f$ with $E [ Y |X ] = f(X)$. You can then define $E[Y | X = x] = f(x)$. Note that, like disintegration, this is only defined up to almost sure equivalence, since $E[Y|X]$ is only unique up to almost sure equivalence. However, you can pick nice representatives. For instance, if $Y$ is an independent coin flip from $X$ with bias $p$, then $E[Y|X] = p$, so we can take $E[ Y|X = x] = p$.

If $X$ follows normal distribution, then $\eta(x) = 0$ for all $x \in \mathbb R$. Then the optimal classifier $g^* =0$. This makes me confused. One direction is to define $\eta(x)$ though PDF-PMF, but I don't know how to modify the proof accordingly. — Akira, Sep 10 '20 at 16:39
@LAD That's not right. If, for instance, $X$ is normal, and $Y$ is an independent, fair coin flip, then $\nu(x) = 1/2$ a.e. — Elle Najt, Sep 10 '20 at 17:54
@LAD Since $Y$ is a binary random variable, there is another interpretation via conditional expectation you can use. I'll add that as 5. — Elle Najt, Sep 10 '20 at 17:56
In measure-theoretic probability theory, $\mathbb P [ Y = 1 | X = x] = \mathbb E[\mathbf{1}{{Y=1}} | \mathbf{1}{{X=x}}]$ and $\mathbb P [ X = x] = \mathbb E[\mathbf{1}{{X=x}}]$. Here $\mathbf{1}{{Y=1}}$ and $\mathbf{1}_{{X=x}}$ are both integrable. Then $\mathbb P [ Y = 1 | X = x]$ is well-defined even $\mathbb P [ X = x] = 0$. Then you meant it's possible that $\mathbb P [ X = x] = 0$ and $\mathbb P [ Y = 1 | X = x] >0$. Have I understood your ideas correctly? — Akira, Sep 10 '20 at 21:03
You meant $\mathbb P [ X = x] = 0$ implies $\mathbb P [ Y = 1 | X = x] =0$? — Akira, Sep 10 '20 at 21:38
I'm sorry, but I'm interested in $\mathbb E [\mathbf{1}{{Y=1}} | \sigma( 1{{X = x}})]$. — Akira, Sep 10 '20 at 21:47
@LAD Oops. I got confused. Let me start over -- if $P(x = x) = 0$, then the sigma algebra generated by it is trivial, so conditioning on it gives you the expectation of $Y$. Note also that because $Y \in {0,1}$, $1_{Y = 1} = Y$. So $E[Y | \sigma(1_{X = x}) ] = E[Y]$. This is not the way to define $E[Y | X = x]$. — Elle Najt, Sep 10 '20 at 21:53
I'm confused too. Could you please write the explicit formula? This problem really makes me re-think all I know (just a little bit) about probability. I've just posted a related question here. — Akira, Sep 10 '20 at 21:55
@LAD The explicit formula is in bullet 2. If you really want to understand 5. you'll need to learn measure theoretic probability theory, in particular the notion of conditioning against a sigma algebra. — Elle Najt, Sep 10 '20 at 21:57
Could you please elaborate on what "This is not the way to define $E[Y | X = x]$" means? — Akira, Sep 10 '20 at 21:57
I mean that you can't define $E[Y|X = x]$ by conditioning against the event $1_{X = x}$. You need to use the entire random variable $X$, and either go the disintegration route, or condition against the $\sigma$ algebra generated by $X$. The reason is the Borel-Kolmogorov paradox. — Elle Najt, Sep 10 '20 at 21:58
If you just want to learn learning theory, just assume that $(X,Y)$ are as in the first bullet, so $Y \sim P_X$, where $P$ is some family of probability distributions with parameter $X$, and pretend that $Y$ conditioned on $X = x$ means that Y has distribution $P_x$. The latter remarks explain how you can justify this. — Elle Najt, Sep 10 '20 at 22:01
@LAD In case it helps: if you fix a sigma algebra $F$ of the sigma algebra $G$, and $Y \in L^2(G)$, then $E[Y|F]$ is the orthogonal projection of $Y$ onto the subspace of $L^2(G)$ consisting of the $F$ measurable functions, say $L^2(F)$. When $A$ is a measure zero event , $L^2(\sigma(A))$ is just the subspace of constant functions, and the orthogonal projection ($E [ 1 \cdot Y] 1$) is calculating the mean. In more intuitive terms, if all you know is whether or not a measure zero event happened, your best guess for $Y$ is just its expectation. — Elle Najt, Sep 10 '20 at 22:26

littleO · Answer 2 · 2020-09-10T10:04:42.413

I think it's a great question. Here is one answer, or at least a partial answer. Suppose that $f$ is a joint PDF - PMF for $X$ and $Y$, so that $$f(x, y) \Delta x \approx P(X \in [x, x+\Delta x] \text{ and } Y = y).$$ Then the expression $P(Y = 1 \mid X = x)$ can be defined to mean $\frac{f(x, 1)}{f(x,0) + f(x,1)}$. Why is this a reasonable definition? Intuitively, because if $\Delta x$ is a small positive number then $P(Y = 1 \mid X = x)$ should be approximately equal to \begin{align} P(Y = 1 \mid X \in [x,x+ \Delta x]) &= \frac{P(Y = 1, X \in [x,x+ \Delta x])}{P(X \in [x,x+ \Delta x])} \\ &\approx \frac{f(x,1) \Delta x}{f(x,0) \Delta x + f(x,1) \Delta x} \\ &= \frac{f(x,1)}{f(x,0) + f(x,1)}. \end{align} I'm not fully satisfied with this explanation, though.

If $X$ follows normal distribution, then $\eta(x) = 0$ for all $x \in \mathbb R$. Then the optimal classifier $g^* =0$. This makes me confused. One direction is to define $\eta(x)$ though PDF-PMF, but I don't know how to modify the proof accordingly. — Akira, Sep 10 '20 at 16:41

score 0 · Answer 3 · answered Sep 10 '20 at 03:49

0

Im not sure I understand your question so please let me know if i havent answered it: I believe you have a misunderstanding about $\eta$. It is the probability that $Y=1$ given the value of $X$, so it is in general not $0$, even in the example you gave.

Building on your example: let $Y$ be distributed as bernoulli with parameter $p$ and independent of $X$, then $\eta(x) =p$ not 0.

That is a great book by the way. Lots of interesting problems in there.

answered Sep 10 '20 at 03:49

dmh

2,958

In my understanding, the conditional probability $\eta(x)=\mathbb{P}{Y=1 \mid X=x}$ is defined only when $\mathbb P {X=x} > 0$. See here. If $X$ follows normal distribution, then $\mathbb P {X=x} = 0$ for all $x \in \mathbb R$. – Akira Sep 10 '20 at 06:10
See https://math.stackexchange.com/questions/110112/probability-conditional-on-a-zero-probability-event it is related to the Radon Nikodym derivative. – dmh Sep 10 '20 at 12:27

How does this optimal classifier make sense in case of continuous random variable?

3 Answers3

Linked