Looking for an understandable discussion of creating Maximum Entropy classifiers

Question

Texts, articles, and papers on Maximum Entropy Classifiers tend to come in two varieties: the more popular "upper level", and the more technical.

The popular variety are good at explaining the Maximum Entropy concept and why such classifiers are considered to be better than Naive Bayes classifiers (and also why they are harder to calculate).

However, the discussions on how the coefficients are determined are much more heavy going. Or they have been when I've tried to read them - perhaps I need some sleep and no customer interruptions :-)

Are there any intermediate descriptions of maximum entropy coefficient determination - especially for the gradient / quasi-Newtonian methods (rather than the iterative methods). Ie. how these methods are used to determine the best coefficients that fit the training data. I've looked at code and I guess I have a conceptual problem understanding the connection between the classifier code itself, and gradient code (LBFGSB, CG, etc).

@mbq Probably true, but it is definitely also on topic for StackOverflow, and the proposed Machine Learning; and Natural Language Processing & Computation Linguistics sites. A question can be on topic for multiple sites. — winwaed, Dec 08 '11 at 16:38

score 2 · Accepted Answer · answered Jan 07 '12 at 08:56

There's quite an extensive discussion in Jaynes, E. T., 2003, Probability Theory: The Logic of Science, Chapter 11 which is probably worth slogging through.

More generally, there is often a separation in stats/ML between the model being proposed, and the optimisation methods that are then used to solve them. In the case of simple models, the optimisation principles are often also simple. However it is sometimes the case that a seemingly simple model requires complex optimisation procedures in order to be efficient, and I believe that is the case here. Of course many different approaches can be taken to solve a single problem, and the most efficient on may be dependent on other factors (sample size, dimension, rank deficiency of the data etc). A recent book, Optimization for Machine Learning, collects together a series of papers on the subject, which provide a pretty comprehensive coverage of the area.

Thanks I'll follow up on those books. With a quick glance I see the probability book is getting good reviews. — winwaed, Jan 07 '12 at 14:16

Looking for an understandable discussion of creating Maximum Entropy classifiers

1 Answers1