Maximum Entropy Density Estimation and Modeling Geographic Distributions of Species (thesis)
Abstract:
Maximum entropy (maxent) approach, formally equivalent to maximum
likelihood, is a widely used density-estimation method. When input
datasets are small, maxent is likely to overfit. Overfitting can be
eliminated by various smoothing techniques, such as regularization and
constraint relaxation, but theory explaining their properties is often
missing or needs to be derived for each case separately. In this
dissertation, we propose a unified treatment for a large and general
class of smoothing techniques. We provide fully general guarantees on
their statistical performance and propose optimization algorithms with
complete convergence proofs. As special cases, we can easily derive
performance guarantees for many known regularization types including
L1 and L2-squared regularization. Furthermore, our general approach
enables us to derive entirely new regularization functions with
superior statistical guarantees. The new regularization functions use
information about the structure of the feature space, incorporate
information about sample selection bias, and combine information
across several related density-estimation tasks. We propose algorithms
solving a large and general subclass of generalized maxent problems,
including all discussed in the dissertation, and prove their
convergence. Our convergence proofs generalize techniques based on
information geometry and Bregman divergences as well as those based
more directly on compactness.As an application of maxent, we discuss an important problem in
ecology and conservation: the problem of modeling geographic
distributions of species. Here, small sample sizes hinder accurate
modeling of rare and endangered species. Generalized maxent offers
several advantages over previous techniques. In particular,
generalized maxent addresses the problem in a statistically sound
manner and allows principled extensions to situations when data
collection is biased or when we have access to data on many related
species. The utility of our unified approach is demonstrated in
comprehensive experiments on large real-world datasets. We find that
generalized maxent is among the best-performing species-distribution
modeling techniques. Our experiments also show that the contributions
of this dissertation, i.e., regularization strategies, bias-removal
approaches, and multiple-estimation techniques, all significantly
improve the predictive performance of maxent.