Why MLEs are asymptotically efficient whereas method of moment estimators are not?

Question

Under appropriate regularity conditions it is well-known that Maximum Likelihood Estimation (MLE) produces asymptotically efficient estimators in the sense that their asymptotic covariance is given by the inverse of Fisher information, i.e. $$ \sqrt n(\theta-\tilde\theta)\overset{d}{\to}\mathcal N(0,I^{-1}(\theta)) $$ as $n\to\infty$.

With the Method of Moments (MoM) this is not necessarily true. There are cases where MLE and MoM produce the same estimators; however, in general, there is no guarantee of asymptotic efficiency in the MoM's.

My question is why? Intuitively, MLE requires specifying the exact distribution from which the observed data is realized. On the other hand, MoM only requires specifying the first $m$-moments ($\theta\in\Bbb R^m$) of the data's distribution. So I suspect this lack of specificity is why we do not have any guarantees of asymptotic efficiency. Is this intuition correct? Can someone spell this out with a more compelling theoretical argument?

Moments are not necessarily sufficient statistics, while the likelihood function always is sufficient? For instance, method of moments for the "German tank problem" https://mathoverflow.net/questions/14964/estimate-population-size-based-on-repeated-observation leads to estimation based on the mean, which is not sufficient in this case. — kjetil b halvorsen, Aug 01 '23 at 17:50
@kjetilbhalvorsen: The likelihood function is not even a statistic. A technically correct version of what I think you're saying is that if there is a sufficient statistic, a unique MLE will be a function of it. — Christian Remling, Aug 01 '23 at 21:32
I don't think MLEs are in general sufficient. I seem to recall that at some time I knew a counterexample. Probably there is one in the book of Romano and Siegel titled Counterexamples in Probability and Statistics. — Michael Hardy, Aug 02 '23 at 00:12
@ChristianRemling : Why would you describe the likelihood function as not a statistic? If you observe the data and you know the family of distributions, you can find the likelihood function without knowing the values of the population parameters. — Michael Hardy, Aug 02 '23 at 00:13
I have heard that for families of distributions for which MLEs may lack a closed form or otherwise be problematic, one can use a method-of-moments estimate as a first approximation to the MLE and then take a single Newton step toward a solution of the likelihood equations, and the result is asymptotically as efficient as the MLE. (But sometimes "asymptotically efficient" isn't worth much. Such as when you have to work with a small sample.) — Michael Hardy, Aug 02 '23 at 00:16
@MichaelHardy: MLE 's are not necessarily sufficient, but they are always functions of a sufficient statistic. — kjetil b halvorsen, Aug 02 '23 at 00:19
@MichaelHardy: A statistic is, by definition, a random variable that is a function of the random sample. The likelihood function depends on the parameter. (If it doesn't, then any statistic is sufficient by Neyman's criterion.) — Christian Remling, Aug 02 '23 at 00:41
@ChristianRemling : You are confused. WHICH FUNCTION is the likelihood function does NOT depend on the value of the parameter. It does depend on which parametrized family of probability distributions is considered, but it does not depend on the value of the population parameter. — Michael Hardy, Aug 02 '23 at 18:38
@Remling : What is "Neyman's criterion"? Criteria I know for sufficiency of a statistic include Fisher's definition and Fisher's factorization theorem. The former says $T(X_1,\ldots,X_n)$ is sufficient for a particular family of distributions iff the conditional probability distribution of $(X_1,\ldots,X_n)$ given $T(X_1,\ldots,X_n)$ is the same for all probability distributions in the family. The latter says $T$ is sufficient iff$,\ldots\qquad$ — Michael Hardy, Aug 02 '23 at 19:06
The latter says $T$ is sufficient if the joint probability density function $f_{X_1,\ldots,X_n}(x_1,\ldots,x_n)$ can be factored as a function of $T(x_1,\ldots,x_n)$ which may be different for different members of the family of distributions times another function that is the same for all members of the family of distributions. $\qquad$ — Michael Hardy, Aug 02 '23 at 19:08
For example, for the family of Poisson distributions, where $X_1,\ldots,X_n$ are i.i.d. observations, the sum $X_1+\cdots +X_n$ is sufficient by the factorization criterion because $$ \begin{align} & \frac{\lambda^{x_1} e^{-\lambda}}{x_1!} \cdots \frac{\lambda^{x_n} e^{-\lambda}}{x_n!} \ {} \ = {} & \underbrace{e^{-n\lambda} \lambda^{x_1+\cdots + x_n}} {}\cdot{} \underbrace{\frac1{x_1!} \cdots \frac1{x_n!}} \end{align} $$ The first factor depends on the $n$-tuple only through the sum, and also depends on $\lambda.$ The second does not depend on $\lambda.$ — Michael Hardy, Aug 02 '23 at 19:17
@ChristianRemling : $\ldots,$and, an example of Fisher's definition: If $X_1,\ldots, X_n\sim\text{i.i.d.} \operatorname{Poisson}(\lambda)$ then $$ \begin{align} & \Pr(X_1= x_1 \ &\ \cdots \ &\ X_n=x_n \mid X_1+\cdots + X_n=s) \ {} \ = {} & \frac{(x_1+\cdots + x_n)!}{x_1!\cdots x_n!}. \end{align} $$ Since that last expression does not depend on $\lambda,$ the sum of the observations is sufficient for this family of distributions. $\qquad$ — Michael Hardy, Aug 02 '23 at 19:28
@ChristianRemling : So I wonder why anyone would cite something called "Neyman's criterion" rather than going back to the source and citing one of these. — Michael Hardy, Aug 02 '23 at 19:29
@ChristianRemling: Why do you think so? See https://stats.stackexchange.com/a/405777/11887 — kjetil b halvorsen, Aug 19 '23 at 01:00
@kjetilbhalvorsen Your original comment seems to be the prevailing reason for the performance differences between the two estimation methods. The data reduction portion of MoM's estimation (raw sample$\to$ moments) is generally not sufficient and incurs information loss. On the other hand, the data reduction of MLE (raw sample$\to$likelihood function) induces a minimal sufficient partition of the sample space and thus does not incur a loss of information. — Aaron Hendrickson, Nov 08 '23 at 15:41
@MichaelHardy: An obvious example where the mle is not sufficient is iid Cauchy observations (with $n$ larger than 1) — kjetil b halvorsen, Nov 08 '23 at 17:19
@ChristianRemling: See https://stats.stackexchange.com/questions/374833/likelihood-function-is-minimal-sufficient/405777#405777 for minimal sufficiency of the likelihood function — kjetil b halvorsen, Nov 08 '23 at 17:22
@kjetilbhalvorsen : Do you mean the MLE for both location and scale in the Cauchy family? Or just for one of those? There's no such thing as an MLE until you've specified a particular family of distributions, e.g. that set of Cauchy distributions with a particular scale parameter (so that location is the only difference between two of them) or those with a particular location parameter (so that scale is the only difference between two of them) or all Cauchy distributions (so that both location and scale are to be estimated). I don't think I know what the MLE is an any of those cases. — Michael Hardy, Nov 08 '23 at 17:33
@kjetilbhalvorsen : Although the MLE for any of those families would be expressed by not more than two scalars, whereas the finest sufficient statistic requires $n$ scalars (the sample size) so that's enough to show that your point is correct. — Michael Hardy, Nov 08 '23 at 17:37
@Both cases should give an example ... the mle must be found numerically. Some examples here https://stats.stackexchange.com/questions/98971/mle-of-the-location-parameter-in-a-cauchy-distribution, https://stats.stackexchange.com/questions/373526/consistent-unbiased-estimator-for-the-location-parameter-of-mathcalcauchy, and search ... — kjetil b halvorsen, Nov 08 '23 at 17:39
@kjetilbhalvorsen : I agree that that is an obvious example. I didn't think of it only because I don't know anything about the MLE of the family of Cauchy distributions. — Michael Hardy, Nov 09 '23 at 18:12
Above I should have written that if $X_1,\ldots,X_n\sim\text{i.i.d.} \operatorname{Poisson}(\lambda)$ then $$ \begin{align} & \Pr(X_1= x_1 \ &\ \cdots \ &\ X_n=x_n \mid X_1+\cdots + X_n=s) \ {} \ = {} & \frac{x_1!\cdots x_n!}{(x_1+\cdots + x_n)!}. \end{align} $$ (I had the reciprocal of that last quantity, which is obviously wrong.) — Michael Hardy, Nov 17 '23 at 05:48

score 2 · Accepted Answer · edited Nov 10 '23 at 14:18

The comment by kjetil b halvorsen seems to have a good point. In view of the ensuing discussion, it may be of use to detail, clarify, and complement some of the raised points, which will be done below.

The likelihood function is usually defined as the map $\Theta\ni\theta\mapsto L_x(\theta):= f_\theta(x)$, where $\Theta$ is the parameter space, $x$ is a realization (value) of the random sample $X$ taken from the distribution with density $f_{\theta_0}$, $\theta_0\in\Theta$ is the "true" value of the parameter $\theta$, and $(f_\theta)_{\theta\in\Theta}$ is a family of probability densities (referred to as the statistical model).

So, for each realization $x$ of $X$ we have its own likelihood function $L_x$. We may then consider the function $x\mapsto \mathcal L(x):=L_x$, and we may even want to refer to this function $\mathcal L$ as the likelihood function as well. The function $\mathcal L$ will be measurable with respect to the cylindrical $\sigma$-algebra over $\mathbb R^\Theta$. So, the random function $L_X:=\mathcal L(X):=\mathcal L\circ X$ is a statistic with values in $\mathbb R^\Theta$. Trivially, the statistic $L_X$ is sufficient, since $f_\theta(x)=L_x(\theta)$ for all $\theta$ and $x$. (However, to focus on the essential ideas, let us not be concerned with measurability matters in what follows in this answer.)

So, trivially, the "maximum likelihood estimator (MLE)" $$\operatorname*{argmax}_{\theta\in\Theta}L_X(\theta)$$ with values in the set of all subsets of $\Theta$ is a function of the sufficient statistic $L_X$. This fact is hardly of any significance, though -- because the (entire) sample $X$ is of course always sufficient, and any statistic is, by definition, a function of the sufficient statistic $X$.

What is important is that, by the factorization criterion, the MLE is a function of any sufficient statistic, including minimal sufficient statistics.

(However, the MLE by itself of course does not have to be sufficient. E.g., if $\Theta=(0,\infty)$, $X=(X_1,\dots,X_n)$, $n\ge2$, and $X_1,\dots,X_n$ are i.i.d. normal random variables each with mean $\theta$ and variance $\theta^2$, then the almost surely unique MLE of $\theta$ is \begin{equation} \hat\theta:=\sqrt{\overline X^2/4+\overline{X^2}}-\overline X/2, \end{equation} where $\overline X:=\frac1n\,\sum_1^n X_i$ and $\overline{X^2}:=\frac1n\,\sum_1^n X_i^2$. However, here $(\overline X,\overline{X^2})$ is a minimal sufficient statistic -- which is not a function of $\hat\theta$, and therefore the MLE $\hat\theta$ is not sufficient.)

On the other hand, estimators other than the MLE (including method-of-moment estimators) do not have to be functions of a minimal sufficient statistic, and are therefore "less likely" to have good statistical properties. One way to see this is that, by the Rao–Blackwell theorem, if $S(X)$ is any estimator of $q(\theta)$ for some function $q$ and $T(X)$ is any sufficient statistic, then (i) $S_T(X):=E_\theta(S(X)|T(X))$ is a statistic (as it does not depend on $\theta$); (ii) $S_T(X)$ is a function of the sufficient statistic $T(X)$ (even if $T(X)$ is a minimal sufficient statistic); (iii) the bias of $S_T(X)$ for $q(\theta)$ is the same as the bias of $S(X)$ for $q(\theta)$, for all values of $\theta$; (iv) the variance of $S_T(X)$ is no greater than the variance of $S(X)$, for all values of $\theta$ (and the latter property generalizes to any convex loss function).

So, we can take any estimator which is not a function of a minimal sufficient statistic $T(X)$ and improve it by the described above Rao–Blackwell conditioning on $T(X)$ -- whereas the MLE cannot be improved this way, since the MLE is already a function of any (minimal) sufficient statistic and hence the conditioning on any sufficient statistic does not change the MLE.

Finally, about this:

Intuitively, MLE requires specifying the exact distribution from which the observed data is realized. On the other hand, MoM only requires specifying the first $m$-moments ($\theta\in\Bbb R^m$) of the data's distribution. So I suspect this lack of specificity is why we do not have any guarantees of asymptotic efficiency. Is this intuition correct?

The answer to this is no. Indeed, if $\theta$ is an $m$-dimensional parameter and the method of moments is applicable, then the knowledge of $m$ moments uniquely determines $\theta$, so that you have the complete specificity. E.g., if $f_\theta$ is the density of the gamma distribution with parameter $\theta:=(\alpha,\beta)\in\Theta=(0,\infty)\times(0,\infty)$, then the first two moments $\mu_1(\theta)=\alpha\beta$ and $\mu_2(\theta)=\alpha\beta^2$ uniquely determine $\theta=(\alpha,\beta)$. Another way to look at this is that, in a parametric model, knowing the value of the parameter $\theta$, you fully know the density $f_\theta$ and thus you fully know the corresponding distribution.

Is the fact that the MLE is always a function of a minimal sufficient statistic a direct consequence of the likelihood function itself being minimal sufficient? — Aaron Hendrickson, Oct 31 '23 at 17:05
@AaronHendrickson : In general, the likelihood function is not minimal sufficient. E.g., consider the iid sample of size $n\ge2$ from the Poisson distribution with parameter $\theta$. — Iosif Pinelis, Oct 31 '23 at 21:52
@IosifPinelis : I wonder whether the truth value of your latest comment depends on what you consider to be the likelihood function. For the example you mention, I might write $$ L(\lambda) \propto e^{-n\lambda} \lambda^{x_1+\cdots + x_n} $$ and say that's the likelihood function. But I imagine some (you?) might write $\text{“} {=} \text{”}$ rather than $\text{“} {\propto} \text{”}$ and include the reciprocal of the product of factorials, in which case the likelihood function would give some information not relevant to estimating $\lambda. \qquad$ — Michael Hardy, Nov 10 '23 at 14:30
People who write about this don't always say which of those two conventions they have in mind. — Michael Hardy, Nov 10 '23 at 14:39
@MichaelHardy : As was explained in my answer, in the context of sufficiency, by "the likelihood function" I mean the statistic $L_X$. This statistic may indeed contain "some information not relevant to estimating" the parameter, in the precise sense that $L_X$ may not be minimal sufficient -- which is what I said in my latter comment. — Iosif Pinelis, Nov 10 '23 at 14:45
I didn't mean to say that your answer omits that. Rather, I was responding to your comment about what Aaron Hendrickson said. — Michael Hardy, Nov 10 '23 at 14:47
$$ \begin{align} & \operatorname{argmax}\limits_{\theta\in\Theta} L_X(\theta) \ {} \ & \operatorname{argmax}{\theta\in\Theta}L_X(\theta) \end{align} $$ I changed the first form above, coded as `\operatorname{argmax}\limits{\theta\in\Theta} L_X(\theta), to the second, coded as\operatorname{argmax}_{\theta\in\Theta} L_X(\theta). I'm guessing the use of\limits` was intended to affect the position of the subscript, but that didn't work. — Michael Hardy, Nov 10 '23 at 15:06
\limits seems to be intended to cause the positioning of subscripts and superscripts in an "inline" setting to behave the way they otherwise would in a "displayed" setting. Since this instance of the use of \operatorname{argmax} was already in a "displayed" setting, \limits would have no effect. — Michael Hardy, Nov 10 '23 at 15:21
@MichaelHardy : The comment was one under, and concerning, the answer. Do you think that in every comment under the answer I have to repeat the relevant definitions given in the answer? — Iosif Pinelis, Nov 10 '23 at 15:24
@MichaelHardy : Have you just downvoted my answer? If so, why? — Iosif Pinelis, Nov 10 '23 at 15:25
https://mathoverflow.net/questions/140481/what-is-a-likelihood-kernel If we take the "likelihood kernel" to be what I defined it to be in the posting linked here, we then have the question of whether the likelihood kernel is a minimal sufficient statistic. — Michael Hardy, Nov 10 '23 at 15:27
@MichaelHardy : If the likelihood is defined, as apparently was done in your comment, up to a factor not depending on the parameter, then you do get a minimal sufficient statistic, according to the "useful characterization of minimal sufficiency". — Iosif Pinelis, Nov 10 '23 at 15:42
It seems the way to think about this is from the perspective of how the sample space is partitioned. If we define the data reduction $\mathbf X\to L(\theta|\mathbf X)$ with the equivalence relation specifying that samples $\mathbf X$ that produce proportional likelihoods are equivalent then we do indeed get a minimal sufficient partition of the sample space. So while the likelihood is not a statistic in the usual sense, it combined with this equivalence relation serves to induce the same unique partitioning that a minimal sufficient statistic would. — Aaron Hendrickson, Nov 10 '23 at 18:06

score 1 · Answer 2 · edited Nov 11 '23 at 06:03

A ``down-to-earth'' observation to see what goes wrong with method of moments is this:

When considering applying the method of moment to $(X_1,...,X_n)$, you may as well apply the method of moments to transformed data-points $(g(X_1),...,g(X_n))$ for any, say, smooth bijection $g$. So which $g$ is the right choice? It is easy to see on particular examples (e.g., $X_i\sim N(\theta,1)$) that choosing a wrong $g$ can make the asymptotic variance much worse.

The answer to ``what is the right function $g$'' leads to the theory of minimal sufficient statistics and of the MLE, that provide estimators that do not depend on a particular choice of transformation $g$. For instance if $X_i\sim f_\theta$ for some density $f_\theta$, the log-likelihood for $(X_1,\ldots,X_n)$ is $\theta\mapsto\sum_i \log(f_\theta(X_i))$ and the log-likelihood for $(Y_1,\ldots,Y_n)=(g(X_1),\ldots,g(X_n))$ (which have density $F_\theta(y)=f_\theta(g^{-1}(y)) \left|\det(\nabla g^{-1}(y))\right|$ is $$ \theta\mapsto \sum_i \log F_\theta(Y_i) = \sum_i \log f_\theta(X_i) + \sum_i \log \left|\det(\nabla g^{-1}(Y_i))\right|. $$ The second term is constant so maximizing with respect to $\theta$ either log-likelihood gives the same $\hat\theta$.

Why MLEs are asymptotically efficient whereas method of moment estimators are not?

2 Answers2