How can these statistical arguments be made logically rigorous?

Question

Suppose an urn contains unknown but non-random numbers of red and green marbles, and I take a random sample of a known and non-random size. Observing the numbers of red and green marbles in the sample, I need to hazard my best guess as to the proportion (not the total number) of red marbles in the urn.

The number of distinct marbles I observe is likely to be larger if the sample is taken without replacement than with replacement. Therefore, there should be less uncertainty in my estimate if I sample without replacement.

THEREFORE the standard deviation of the random number of red marbles in my sample is smaller if I sample without replacement than with.

Clearly the above is not a logically rigorous argument. (Of course, one can establish the same result by easy standard arguments, but that will be off-topic here.)

Now suppose I take a sample of $n$ independent measurements $X_1,\dots,X_n$ from a population that is normally distributed with unknown mean $\mu$ and with variance $\sigma^2$. Let $\overline{X} = (X_1+\cdots+X_n)/n$ and $S^2 = \sum_{i=1}^n (X_i - \overline{X})^2/(n-1)$ be respectively the sample mean and sample variance. Then $$ \frac{\overline{X} - \mu}{\sigma/\sqrt{n}} $$ is normally distributed with known mean and variance 0 and 1 respectivly, so the endpoints of a confidence interval for $\mu$ are $$ \overline{X} \pm A \frac{\sigma}{\sqrt{n}} $$ where $A$ is a suitable percentage point of the normal distribution (provided (unrealistically) that $\sigma$ is known).

And $$ \frac{\overline{X} - \mu}{S/\sqrt{n}} $$ has a Student's t-distribution with $n-1$ degrees of freedom, so $$ \overline{X} \pm B_n \frac{S}{\sqrt{n}} $$ where $B_n$ is the corresponding percentage point of Student's distribution, are the endpoints of a corresponding confidence interval if (more realistically) $\sigma$ is unknown.

There is more uncertainty if $\sigma$ is unknown than if $\sigma$ is known, THEREFORE $B_n > A$.

If $\sigma$ is unknown, then the amount of uncertainty decreases as the sample size $n$ increases, THEREFORE $B_n$ decreases as $n$ increases. (Of course, the fact that $B_2 > B_3 > B_4 > \cdots > A$ can be proved by standard arguments involving fiddling with integrals.)

The logical leaps corresponding to the THEREFOREs above have a seeming intellectual compellingness about them. It would be extraordinarily paradoxical if the conclusions following them failed to hold despite the reasoning preceding them. But these are not logically rigorous arguments.

So my question is whether there could be some theorem of logic that says things like those following the THEREFOREs are true when things like those preceding them are true.

I don't see where the whole discussion of marbles sampled with or without replacement fit with the rest of the question. — Thierry Zell, Dec 15 '10 at 21:59
Do you have a mathematical meaning of "uncertainty" in mind? — Daniel Litt, Dec 16 '10 at 02:11
@Daniel: If you have in mind some precisely mathematically defined meaning of "uncertainty" in mind, that might have the effect of reducing my question to a well-defined math problem. And maybe that would amount to a complete answer to my question, even if a lot of work still had to be done to solve the math problem. Or maybe it wouldn't. — Michael Hardy, Dec 16 '10 at 02:45
@Michael: well, the obvious question is then the following: am I missing something or are the marbles irrelevant? With the follow up: if I am missing something, what is it? If the marbles are irrelevant, why are they here? — Thierry Zell, Dec 16 '10 at 03:36
The inference from the things stated before "THEREFORE" in the situation with the marbles to the conclusion stated after it, is the same sort of thing as the inference corresponding to "THEREFORE" involving the normal distribution. — Michael Hardy, Dec 16 '10 at 04:32
@Felipe: Any satisfying answer that encompasses both of the examples would have to be simple; it would have to be something about what inferences can be validly drawn. Hence logic. — Michael Hardy, Dec 16 '10 at 16:48
@Thierry: To follow up on my earlier answer: The marbles may not be relevant to the example that follows them. They're a separate example of the same phenomenon. — Michael Hardy, Dec 16 '10 at 16:50
@Michael:In a site such as this, I would interpret the logic tag to mean that the question had something to do with Mathematical Logic (the subdiscipline) not that some statement was a logical consequence of another. — Felipe Voloch, Dec 16 '10 at 19:45
@Felipe: So would I. You see that in my examples I have not shown that the thing that follows "THEREFORE" is a logical consequence of the thing that preceded it. A good answer would propose some sort of mode of reasoning of broad applicability rather than merely a theorem about probability distributions or something like that. — Michael Hardy, Dec 16 '10 at 22:26
As I logician I would just like to remark that when you capitalize words, and even make them bold faced, it looks like YOU ARE SHOUTING OUT LOUD. Also, the logic tag is unwarranted because it would indicate that the question has something to do with formal logic, but in reality this just seems to be a discussion among statisticians and probability theorists. — Andrej Bauer, Jan 20 '20 at 07:58
@AndrejBauer : Perhaps in certain moods I might admit that the connection with formal logic is not as obvious as is the connection of some other questions with formal logic. — Michael Hardy, Jan 20 '20 at 15:25
Well, I am subscribed to the logic tags and this question sticks out. Don't get me wrong, I upvoted the questions because as a logician I like to think about math that's hard to make precise, but it just isn't about logic. It's really about "precise ordinary math". — Andrej Bauer, Jan 20 '20 at 20:22

score 5 · Answer 1 · answered Dec 16 '10 at 04:30

5

Let's recall how all these confidence intervals work in general. You have some real parameters $t,s$ that define the distribution. You have two unbiased estimators $T$ and $S$ for these parameters that depend on the (vector) observation $X$. You want to create the confidence interval for $t$ from $T$ and known $s$ in one case and from $T$ and $S$ in the other case. We will also assume that the distribution of $T$ is symmetric around $t$ for all $s$ with the density decreasing in both directions, and the distribution of $T-t$ depends on $s$ but not on $t$. The first case is then simple: we take $T$ and find the least $a=a(s)$ such that $P(|T-t|>a(s))<\delta$ and say that $[T-a(s),T+a(s)]$ is the $\delta$-confidence interval. What we mean by that is that regardless of what the actual value of $t$ is, the probability that this interval won't cover it is less than $\delta$. Let's also assume that $s>0$ and $a(s)$ is increasing in $s$ Now, let us ask whether it is possible that $[T-a(S),T+a(S)]$ is a $\delta'$-confidence interval with $\delta'<\delta$ (which, in your terminology, would mean that determining $s$ from observation beats knowing it)?

The answer is, of course "Yes": despite $S$ is an unbiased estimator, it is perfectly possible that $S$ overestimates more often than it underestimates and $a$ is more sensitive to overestimates than to underestimates. To create a particular example, take the uniform distribution on the interval $[t-s^2,t+s^2]$ and the observation $X_1,X_2,X_3$. Take $T=X_1$ (I know, this is a bit stupid, but how do you know whether your estimators are not equally stupid in general?). Then $a(s)=(1-\delta)s^2$. Now, take $S=c\sqrt{|X_2-X_3|}$ where $c$ is chosen so that $S$ is unbiased. It remains to compute the probability that the true value is in the interval. $S$ is independent of $T$ and is $ys$ with probability $p(y)dy$ with some positive continuous $p$ on $[0,c\sqrt{2})$. So, we get $\int p(y)\min(1,(1-\delta)y^2)dy=(1-\delta) \int p(y)y^2dy>(1-\delta)$ because $\int p(y)ydy=1$ already ($S$ is unbiased), provided that $1-\delta$ is small enough (which is ridiculous too, but how do you tell the difference?). So, your $B_n$ may be larger than $A$ not for some deep probabilistic reason, but for the mundane reason that $\sqrt s$ is concave, not convex like $s^2$ in my example.

Now, this doesn't show that we can do better with less information. It merely shows that the proposed way to compare effectiveness is fundamentally flawed. But how else can you compare the confidence intervals?

answered Dec 16 '10 at 04:30

fedja

59,730

For $S$ to be unbiased, $c$ must depend on $s$ (in fact, $c=sc_0$ for an absolute constant $c_0$). Hence $S$ is not an estimator of $s$ based on the sample $X$. But maybe I am missing something? – Did Dec 16 '10 at 06:50
Check again: $X_2-X_3$ is proportional to $s^2$ (stretch the interval, and the difference will stretch the same number of times), so the square root is proportional to $s$, meaning that if you divide it by $s$, the distribution won't depend on $s$. Am I missing something? Note that $S$ is an estimator for the parameter $s$, not for the length $s^2$! – fedja Dec 16 '10 at 07:28
fedja: you are right. Sorry about the noise. – Did Dec 16 '10 at 12:24
1

Didier: No need to apologize. It is always pleasant when somebody reads your posts or papers attentively enough to start noticing real or imaginary errors (an imaginary error a reader notices is just a signal that something may not be written clear enough or emphasized, and once a signal carries a message, it cannot be noise) :). – fedja Dec 16 '10 at 13:24
@fedja: I'm slightly rushed today, but I think within the next couple of hours I'll have digested your answer somewhat............ – Michael Hardy Dec 16 '10 at 15:32
I've looked at this a little bit more, and so far it's not clear to me why you're seeking an unbiased estimator of the square root of the scale parameter rather than an unbiased estimator of the scale parameter itself. And do you insist on unbiasedness? That is inessential. I could have based a confidence interval on $\sum_{i=1}^n (X_i - \overline{X})^2/n$ rather than using $n-1$, and the interval would have been the same (instead of the standard Student's distribution, I'd have used a slightly rescaled version. I was merely following conventional usage in choosing $n-1$ instead of $n$. – Michael Hardy Dec 17 '10 at 01:46
OK, at the end of a long day of administering exams I've glanced for a third time at what you're asserting, and I see two things: (1) I need to ponder your proof, not just your assertion; and (2) Your example is of course one of those cases where the data contain relevant information not used in forming the confidence interval, so that if we have (for example) a 90% coverage rate, nonetheless sometimes the data themselves tell use that the case we've got is one of the other 10%. (E.g. if $X_2$ and $X_3$ are close together and $X_1$ is remote from them.) – Michael Hardy Dec 17 '10 at 03:53
....and (3) I'm not yet sure whether your use of the over-rated concept of unbiasedness (over-rated, at least, by the broader public, if not by statisticians) is essential to your argument. But now I'm going to crash for the night and look at this again tomorrow. – Michael Hardy Dec 17 '10 at 03:55
P.S.: There is a paper by Robert Buehler published about 50 years ago, that I think some people may have construed as meaning the usual confidence intervals based on the t-distribution may actually fail to take into account some of the relevant information in the data. I ought to look at that and figure out what it's actually saying. – Michael Hardy Dec 17 '10 at 03:56
P.P.S.: Did you intend your final question as a rhetorical question suggesting that there's some specific other way I should be thinking about this? – Michael Hardy Dec 17 '10 at 03:58
Oh, my! That's a whole series of comments! Not sure I'll answer everything... Unbiasedness: this is one way I know to show that you don't always over/underestimate, in which case everything is trivial (of course, if you consistently underestimate, you'll have to use a larger function, and if you consistently overestimate, you can certainly use a smaller one). I mean, the estimator should relate to the true value in some way. And be careful with $n-1$: replacing it by $n$ is in your favor (you get an underestimate) but replacing it by $n-1000$ may not work. – fedja Dec 17 '10 at 06:01
Square root: And why not? Why do you estimate $\sigma^2$ instead of $\sigma$, which is the "true scale parameter" for the Gaussians? I'll tell you why: the concavity is then in your favor because the square root of your estimator is an underestimate for $\sigma$. So why is it not the real reason that you have to multiply it by a larger number?
The questions are not rhetoric. If you can answer them, it'll clarify a few things. We both agree that the problem is vague so far. I formalized what you wrote but not what you meant to mean to mean.
– fedja Dec 17 '10 at 06:10
You can't just use a larger or smaller function to remedy bias unless the amount by which it is to be made larger or smaller is observable, i.e. either can be computed based on the data or derived mathematically, as opposed to depending on the unobservable things you're trying to estimate. Example: Suppose $X_1,\dots,X_n$ are i.i.d. Bernoulli trials, i.e. independent and each equals 0 with probability $p$ and 1 with probability $1-p$. Then there can be NO unbiased estimator of $\sqrt{p}$, since only polynomial functions of $p$ can be estimated without bias in this case. – Michael Hardy Dec 17 '10 at 15:30
I'm surprised you defend unbiasedness on the grounds that an estimator has to be related to the parameter it's estimating. You write quite astutely at points, but that part is just clumsy. See my paper "An Illuminating Counterexample" (the title is a pun), American Mathematical Monthly, Vol. 110, No. 3 (Mar., 2003), pp. 234--238. The phenomenon described there, that unbiasedness can be a very bad thing, has long been widely known; my example has more visual strikingness (IMO) than earlier ones. – Michael Hardy Dec 17 '10 at 15:40
In the example with which I started this thread, as I pointed out, unbiasedness has NO effect on the bottom line. I could have used the maximum-likelihood estimator of $\sigma^2$, which is the biased estimator $\sum_{i=1}^n (X_i - \overline{X})^2/n$ and the resulting confidence interval would have been exactly equal to the one that you get by using the conventional unbiased estimator. (BTW, this is one case where the biased estimator (slightly) outperforms the unbiased one if you judge by the mean-squared-error criterion.) – Michael Hardy Dec 17 '10 at 15:43
You are mistaken as to the reason for starting with an estimator of the square of the scale parameter. The reason is that in this case that is demonstrably more efficient than using (e.g.) mean absolute deviation or other alternatives. (At this point some people could jump in to say that only slight deviations from normality upset that conclusion. (But only slightly.)) – Michael Hardy Dec 17 '10 at 15:46
Well, so it becomes not "show that there is a general effect for all reasonable estimators" but merely that "there is an effect for some chosen estimators where the choice is determined by some unknown "efficiency" criterion. OK, state this criterion for the general distribution with parameter then and I'll see if I can modify my example. About "exactly equal", it may be true for $n$ and $n-1$ by some coincidence, but certainly not for $n$ and $n/2$, so $n-1000$ will change the picture a bit. – fedja Dec 17 '10 at 16:39
Oh, yes, about the word "reason to choose". I use it in the sense "the bias in the choice of examples determined solely by the conditions and conclusions of the conjecture made". I mean, $239$ is the best number of all for many reasons (for me, at least). Now I conjecture that $n^2>10n$ for all $n$, and give $n=239$ as a supporting evidence to my conjecture. You'll say: "you just obviously chose a too big number!", and I'll reply "No, my choice was dictated not by its size at all but by ...". Linguistics, linguistics, linguistics... – fedja Dec 17 '10 at 16:46
It's not just "some coincidence". $n/2$ or $n/2000$ will also give you the exactly same result. It's trivial to show that. – Michael Hardy Dec 17 '10 at 16:59
The same result in terms of the interval length. But you compare to $A$ not that, but the factor $B_n$ at the estimator, which is certainly twice smaller with $n/2$ in the denominator. I mean, we just speak different languages all the time (which is normal, of course). – fedja Dec 17 '10 at 17:04
Well, I think maybe this afternoon or tomorrow I may be able to carefully sort through the rest of the details of your answer. I'm guessing possibly what's going on may depend on the fact that your interval was constructed in a way that doesn't make use of all the relevant information in the data. The interval should be based on the pair $(\min{X_1,X_2,X_3}, \max{X_1,X_2,X_3})$, since the conditional distribution of $X_1,X_2,X_3$ given the max and the min does not depend on $s$ and $t$. Beyond that, there actually a bit more to the notion of making use of all of the..... – Michael Hardy Dec 17 '10 at 21:28
.....relevant information, and that part is not as widely understood, but can be illustrated by an example similar to yours: If there were just two observations $X_1,X_2$, if $s$ were known to be equal to 1 then the interval $(\min{X_1,X_2},\max{X_1,X_2})$ would be a 50% confidence interval for $t$, but if the min and the max were very close together, you wouldn't be at all confident that $t$ is in that interval, whereas if they were $1.999$ units apart, you'd be a lot more than 50% confident. I don't know who came up with that example. – Michael Hardy Dec 17 '10 at 21:31
.........still busy grading tests; I'll come up for air at some point and post something further here. – Michael Hardy Dec 20 '10 at 21:08

Iosif Pinelis · Answer 2 · 2020-01-20T07:11:25.767

$\newcommand\si{\sigma}$There is no THEREFORE in general. Indeed, suppose we observe the values of the pair $(X,S)$ of independent random variables, where $P(X=\mu+\si)=P(X=\mu-\si)=1/4$, $P(X=\mu)=1/2$, $S:=\sqrt{S^2}$, $P(S^2=\si^2/2)=P(S^2=3\si^2/2)=1/2$, $\mu$ is an unknown real parameter, and $\si$ is a possibly unknown positive real parameter. If $\si$ is known, our confidence interval (CI) (for $\mu$) is symmetric about $X$ and based on the pivot $\frac{X-\mu}\si$ (by definition, a pivot is any random variable whose distribution does not depend on the parameters). If $\si$ is unknown, our CI is again symmetric about $X$ but now based on the pivot $\frac{X-\mu}S$. This setting is quite similar to your normal-sampling setting, with my pair $(X,S)$ in place of your pair $(\bar X,S)$. In particular, I also have $ES^2=\si^2$. (If desired, one may arbitrarily closely approximate the discrete distributions of $X$ and $S$ by continuous ones.)

Now, the CI with $\si$ known will be always better than the CI with $\si$ unknown if and only if $U:=\big(\frac{X-\mu}\si\big)^2$ is stochastically less than $V:=\big(\frac{X-\mu}S\big)^2$, that is, if $$P(U\ge u)\overset{\text{(?)}}\le P(V\ge u) \tag{1}$$ for all real $u$. However, $$P(U\ge1)=\tfrac12>\tfrac12\times\tfrac12 \\ =P\big((X-\mu)^2=\si^2\big)\,P(S^2=\si^2/2) \\ =P\big((X-\mu)^2=\si^2,S^2=\si^2/2\big) \\ =P(V\ge1).$$ So, (1) fails to hold for $u=1$.

However, $U$ is better than $V$ on an average: $$EV=E\Big(U\frac{\si^2}{S^2}\Big)=EU\,E\Big(\frac{\si^2}{S^2}\Big) >EU\,\frac{\si^2}{ES^2}=EU, $$ where the inequality is an instance of Jensen's inequality.

How can these statistical arguments be made logically rigorous?

2 Answers2