Nonmonotonic Multicontext Models

Report ID: TR-486-95
Author: Ristad, Eric Sven / Thomas, Robert G.
Date: 1995-02-00
Pages: 25
Download Formats: |Postscript|
Abstract:

We introduce three new techniques for statistical language models: multicontextual modeling, nonmonotonic contexts, and the divergence heuristic. Together these techniques result in language models that have few states, even fewer parameters, and low message entropies. For example, our techniques achieve a message entropy of 1.97 bits/char on the Brown corpus using only 94352 parameters. By modestly increasing the number of model parameters in a principled manner, our techniques are able to further reduce the message entropy of the Brown Corpus to 1.92 bits/char. In contrast, the character quadgram model requires more than 236 times as many parameters in order to achieve a message entropy of only 2.59 bits/char. Given the logarithmic nature of codelengths, a savings of 0.62 bits/char is quite significant.