Here are two quotations from the opposing candidates in the 2004 Presidential Debates:
I proposed a constitutional amendment. The reason I did so was because I was worried that activist judges are actually defining the definition of marriage, and the surest way to protect marriage between a man and woman is to amend the Constitution.
I believe marriage is between a man and a woman. But I also believe that because we are the United States of America, we're a country with a great, unbelievable Constitution, with rights that we afford people, that you can't discriminate in the workplace.
In this assignment, your task will be to write a computer program that can (usually) predict whether a quote such as these are more likely to have been said by George W. Bush or John F. Kerry. There are many reasonable approaches to this problem. In this assignment, we will consider one of these: We will begin by building a statistical model of the kind of language used by each of the candidates. Then, given a new piece of text, we will determine who said it by measuring how well it fits each of the statistical models, and selecting the one that gives the best fit.
Markov models. The probabilistic language models we will be using are called Markov models. Markov models are at the heart of modern speech recognition systems, and are used for a broad range of natural language processing tasks, as well as many other problems relevant to artificial intelligence. They are widely used to model all sorts of dynamical processes in engineering, mathematics, finance and many other areas.
A Markov model defines a probability distribution over sequences of symbols. Said differently, a Markov model defines a probabilistic mechanism for randomly generating sequences over some alphabet of symbols. In a zero-th order Markov model, the symbols in the sequence are generated with a fixed probability that does not in any way depend on other symbols in the sequence. For instance, the alphabet might consist of just the letters a and b, and the zero-th order Markov model might specify that each symbol in the sequence is a with probability 2/3 and b with probability 1/3. Thus, such a zero-th order Markov model might generate a sequence like the following:
a b a b a a a b b b a a a a b b b a a b a a a a a b a a a a b a a a b b
In a first order Markov model, the probability of the current symbol that is being generated can depend on preceding symbol. For instance, consider a sequence in which every occurrence of b is always followed by a, while an a is followed with equal probability by either a or b. For instance, such a model might generate a sequence like the following.
b a b a b a b a a b a b a b a b a a b a a b a a a a a a b a b a a a a b
In this example, the sequence is likely to have about the same proportion of a's and b's as in the preceding example. However, the model is first order rather than zero-th order because whether or not the current symbol is a or b significantly affects the probability of what symbol will be generated next.
This idea can be generalized to a k-th order Markov model in which the probability of the current symbol can depend on the preceding k symbols.
Markov models are frequently used to build a probabilistic model or statistical estimate of a natural language. In modeling ordinary English text, a zero-th order model is clearly inadequate for most purposes, capturing only the frequencies of each letter. On the other hand, a first order Markov model can capture the notion, for instance, that a "q" is nearly always followed by "u". Higher order Markov models can capture the tendency of the letters "proba" to be followed by "b" or "t". (In other language applications, entire words are regarded as the "letters" of the "alphabet" so that the Markov model provides a model of the sequences of words rather than letters that are likely to be observed. We will not consider such a framework in this assignment, but you are welcome to explore it on your own.)
Language modeling. The first part of the assignment is to implement a class called Markov.java that can build a k-th order Markov model from text stored in a given file. (Eventually, you will build two objects using this class, one for Bush and one for Kerry.) Constructing a k-th order Markov model from text sequences is mostly a matter of counting. For each sequence p of length k (let us call it the context), we need to estimate the probability of p being followed by each letter c in our alphabet. Given a text sample, this probability can be estimated by the number of times that p is followed by c, divided by the number of times that p appears as a context at all. That is, we can estimate the probability of c following p by N(pc)/N(p_), where N(pc) denotes the number of times we observe the concatenated sequence pc, and N(p_) denotes the number of times that we observe p followed by any other letter in the alphabet.
Unfortunately, this estimate will be problematic if some of the counts are zero, as is certain to happen on real data. Therefore, instead, we will use a different "smoothed" estimate, namely (N(pc)+1) / (N(p_) + S), where S is the size of our alphabet. This form of smoothing, called Laplace smoothing, has theoretical justification coming from probability theory, and ensures that zero counts will not be a problem.
For instance, if the data consists the two sequences "aabacb" and "abaabcaac", and we are using a second-order Markov model (k=2) and the three-letter alphabet {a,b,c}, then we can estimate the probability that "aa" is followed by "b" by (2 + 1)/(3 + 3) = 1/2 since N(aa_) = 3, N(aab) = 2 and S=3. Similarly, the estimated probability of "a" following "aa" is 1/6, and the estimated probability of "c" following "aa" is 1/3.
So, in constructing a Markov model, your first job is to write code that will compute the appropriate counts N(.) by scanning the text file and counting how many times all sequences of the required lengths appear in the file. These counts should be stored appropriately for later use.
Determining the likelihood of a new sequence. Next, we will want to use the Markov model that you constructed from data to compute a measure of how well that model fits a new text sequence. Later, we will use this measure to determine which of two models (one for Bush and one for Kerry) better fits a test sequence. To compute a measure of fit, we will compute the probability of the model generating the new sequence, a quantity usually called the likelihood of the sequence under the model. For each symbol c in the sequence, we can compute the probability of observing c under the model, given its k-letter context p. In particular, this is just the Laplace-smoothed estimate given above. To compute the likelihood, or probability of the entire sequence, we can multiply these probabilities together, for all symbols in the sequence. Although correct mathematically, on a real computer, this procedure will lead to a product that is mathematically positive but indistinguishable from zero on a real computer. Therefore, rather than multiplying probabilities, you will need to work with log probabilities. That is, for each symbol c in the sequence, you will need to compute the log of its probability, and you will then add (rather than multiply) these log probabilities to arrive at the final log likelihood of the sequence. Keep in mind that all of these log probabilities will be negative (since probabilities are never more than one).
Your second job is to compute, for a given string, and for each position i in the string, an estimate of the log probability of the character at position i, given its k-character context. The log likelihood of this entire sequence can then be computed as the sum of these log probabilities.
For instance, continuing the example above, having constructed the second order (k=2) Markov model, suppose we are given a new string "aaca". We can compute the log probability of each of the four symbols in this string as follows (where the first column gives the context of each of the symbols in the second column):
|| a log((2+1)/(2+3)) = -0.51082562
|a a log((1+1)/(2+3)) = -0.91629073
aa c log((1+1)/(3+3)) = -1.09861229
ac a log((0+1)/(1+3)) = -1.38629436
TOTAL log likelihood = -3.91202301
AVERAGE log likelihood = -0.97800575
A few things to notice here: First, to handle the beginning of the string where characters have a context that is shorter than k, we pretend that the string was preceded by an infinitely long sequence of special characters not appearing elsewhere in the text, say the symbol "|". This means, for instance, that the first character in each string has a length-k context consisting entirely of this symbol. (See the checklist for more on this issue.) Second, in computing the counts N(p_) for contexts p, we only count places where p was a context. This is why, in the example above, N(ac_) = 1 rather than 2.
Choosing the best model. Now that we have a method of measuring the fit of a model to a particular string, we can choose the best model to be the one that maximizes the likelihood of the string. In other words, we can simply measure the likelihood of the string under each model and choose the one that is greatest.
Write a program BestModel.java that takes three command line arguments: the value of the order parameter k, and two files containing excerpts from the debates for the two candidates (or other text that you might choose to experiment with). Your program should build models for each of the candidates, and then read in test text sequences, printing out classifications in the form given below.
Input format. Each of the text files has the following format:
| tag1
Here is the first text block. The text
might go on and on.
Paragraphs are okay too.
| tag2
Here is the second text block.
...
Each file contains a sequence of text blocks separated by lines of the form "| tag " where the tag is an arbitrary sequence of letters intended only as a name for each of the text blocks. Files of this form can be read using the MarkovIn.java code that we are providing. Note that each file consists of multiple blocks of text; even so, you should build a single model for each file based on counting letter patterns observed in all of the text blocks.
Output format. For each text block read in from standard input, your program should output the tag of the text block and its average likelihood under each model (that is, the likelihood as computed above, divided by the length of the sequence -- this is simply to prevent unusually long sequences from seeming more significant than they are). Also, print the difference of these two numbers; the sign of the difference will indicate which model is more likely, and the absolute value of the difference can be interpreted as a measure of confidence in this prediction.
To get an idea of why the algorithm is making its predictions, you should also find and print the ten positions in the test sequence where the difference in the log probabilities of the two models are greatest. That is, if the log probability of character i (given its context) under model j is l[j][i], then you should find the 10 indices i for which |l[0][i] - l[1][i]| is greatest. For each of these, print i's context, character i, l[0][i], l[1][i] and l[0][i]-l[1][i]. For instance, you might get output like this:
% java BestModel 2 bush1+2.txt kerry1+2.txt < bush3.txt
bush3-1: -2.120 -2.193 +0.073
spr -2.730 -4.970 +2.240
eek -5.624 -3.461 -2.163
eek -5.624 -3.461 -2.163
ari -6.016 -3.903 -2.113
siv -3.850 -5.903 +2.052
siv -3.850 -5.903 +2.052
siv -3.850 -5.903 +2.052
ban -2.356 -4.182 +1.826
sig -2.752 -4.516 +1.765
fea -2.657 -4.407 +1.750
bush3-2: -2.146 -2.155 +0.009
Of -2.375 -4.344 +1.969
My -1.396 -2.996 +1.599
sal -4.963 -3.422 -1.541
sal -4.963 -3.422 -1.541
Of -2.868 -4.394 +1.527
My -1.846 -3.308 +1.462
nen -2.281 -3.705 +1.424
k I -5.389 -4.007 -1.382
dis -1.955 -3.296 +1.341
dis -1.955 -3.296 +1.341
...
The first line indicates that on the text block called "bush3-1", the average likelihood under the model built using bush1+2.txt was -2.120, and -2.193 under the kerry1+2.txt model. The difference of these numbers is +0.073, which, being positive, indicates an overall prediction that this quote was uttered by Bush. The next line indicates that the most significant difference in log probabilities between the two models came in predicting "r" under context "sp" where the Bush log probability was -2.730 and the Kerry log probability was -4.970, the difference being +2.240.
When printing characters from the text, you should convert all white space to ordinary spaces.
Data files. We are providing data files from the three presidential debates. You should train using the first two debates and test on the third. For amusement, you might also see how your Bush/Kerry models predict when attempting to classify quotes from John Edwards and Dick Cheney taken from the vice presidential debate.
Analysis. Analyze your approach to this problem giving estimates of its time and space requirements. Also, critically explore the effectiveness of this technique for this task.
Submission. Your main program should be named BestModel.java. Submit all of the other files needed by your program (e.g., Markov.java) except for MarkovIn.java, which we will provide. A program report must also be submitted electronically. (See the checklist for details.)
Extra credit. For extra credit, you can apply this program to other related tasks (for instance, distinguishing two authors, distinguishing German text from English, etc.) Submit any data files that you create (or, if they are too big for whiteboard, tell us how we can get them). Write a description of how well it performs.
This assignment was developed by Rob Schapire.