Here are two quotations from the opposing candidates in the 2004 Presidential Debates:
I proposed a constitutional amendment. The reason I did so was because I was worried that activist judges are actually defining the definition of marriage, and the surest way to protect marriage between a man and woman is to amend the Constitution.
I believe marriage is between a man and a woman. But I also believe that because we are the United States of America, we're a country with a great, unbelievable Constitution, with rights that we afford people, that you can't discriminate in the workplace.
In this assignment, your task will be to write a computer program that can (usually) predict whether a quote such as these are more likely to have been said by George W. Bush or John F. Kerry. There are many reasonable approaches to this problem. In this assignment, we will consider one of these: We will begin by building a statistical model of the kind of language used by each of the candidates. Then, given a new piece of text, we will determine who said it by measuring how well it fits each of the statistical models, and selecting the one that gives the best fit.
Markov models. The probabilistic language models we will be using are called Markov models. Markov models are at the heart of modern speech recognition systems, and are used for a broad range of natural language processing tasks, as well as many other problems relevant to artificial intelligence. They are widely used to model all sorts of dynamical processes in engineering, mathematics, finance and many other areas.
A Markov model defines a probability distribution over sequences of symbols. Said differently, a Markov model defines a probabilistic mechanism for randomly generating sequences over some alphabet of symbols. In a zero-th order Markov model, the symbols in the sequence are generated with a fixed probability that does not in any way depend on other symbols in the sequence. For instance, the alphabet might consist of just the letters a and b, and the zero-th order Markov model might specify that each symbol in the sequence is a with probability 2/3 and b with probability 1/3. Thus, such a zero-th order Markov model might generate a sequence like the following:
a b a b a a a b b b a a a a b b b a a b a a a a a b a a a a b a a a b b
In a first order Markov model, the probability of the current symbol that is being generated can depend on preceding symbol. For instance, consider a sequence in which every occurrence of b is always followed by a, while an a is followed with equal probability by either a or b. Such a model might generate a sequence like the following.
b a b a b a b a a b a b a b a b a a b a a b a a a a a a b a b a a a a b
In this example, the sequence is likely to have about the same proportion of a's and b's as in the preceding example. However, the model is first order rather than zero-th order because whether or not the current symbol is a or b significantly affects the probability of what symbol will be generated next.
This idea can be generalized to a k-th order Markov model in which the probability of the current symbol can depend on the preceding k symbols.
Markov models are frequently used to build a probabilistic model or statistical estimate of a natural language. In modeling ordinary English text, a zero-th order model is clearly inadequate for most purposes, capturing only the frequencies of each letter. On the other hand, a first order Markov model can capture the notion, for instance, that a q is nearly always followed by u. Higher order Markov models can capture the tendency of the letters proba to be followed by b or t. (In other language applications, entire words are regarded as the "letters" of the "alphabet" so that the Markov model provides a model of the sequences of words rather than letters that are likely to be observed. We will not consider such a framework in this assignment, but you are welcome to explore it on your own.)
Language modeling. The first part of the assignment is to implement a class called MarkovModel.java that can build a k-th order Markov model from text stored in a given file. (Eventually, you will build two objects using this class, one for Bush and one for Kerry.) Constructing a k-th order Markov model from text sequences is mostly a matter of counting. For each sequence p of length k (let us call it the context), we need to estimate the probability of p being followed by each letter c in our alphabet. Given a text sample, this probability can be estimated by the number of times that p is followed by c, divided by the number of times that p appears as a context at all. That is, we can estimate the probability of c following p by N(p·c) / N(p), where N(p·c) denotes the number of times we observe the concatenated sequence p·c, and N(p) denotes the number of times that we observe p.
Unfortunately, this estimate will be problematic if some of the counts are zero, as is certain to happen on real data. Therefore, instead, we will use a different "smoothed" estimate, namely (N(p·c) + 1) / (N(p) + S), where S is the size of our alphabet. This form of smoothing, called Laplace smoothing, has theoretical justification coming from probability theory, and ensures that zero counts will not be a problem.
For example, if the input text is "aabcabaacaac", and we are using a second-order Markov model (k = 2) and the three-letter alphabet is {a, b, c}, then we can estimate the probability that "aa" is followed by 'c'; by (2 + 1) / (3 + 3) = 1/2 since N(aa) = 3, N(aac) = 2 and S = 3. Similarly, the probabilities that "aa" is followed by 'a' and 'b' is 1/6 and 1/3, respectively. Note that the probabilities sum to 1.
One thing to notice here: To handle the beginning and end of the string, we treat the string as circular. Thus, N(caa) = 2 instead of 1.
So, in constructing a Markov model, your first job is to write code that will compute the appropriate counts N(·) by scanning the text file and counting how many times all sequences of the required lengths appear in the file. These counts should be stored appropriately for later use. You should also compute and record the alphabet size S. Organize your program by creating a data type MarkovModel so that the first line below computes the number of occurrences of each substring from s of size k and k+1; the second line prints it to standard output; and remaining lines compute and print Laplace smoothed probability estimates.
For example, when k = 2 and s = "aabcabaacaac", this code fragment should print the following.MarkovModel model = new MarkovModel(k, s); System.out.println(model); System.out.printf("%.4f\n", model.laplace("aac")); System.out.printf("%.4f\n", model.laplace("aaa")); System.out.printf("%.4f\n", model.laplace("aab"));
S = 3 "aa" 3 "ab" 2 "ac" 2 "ba" 1 "bc" 1 "ca" 3 "aab" 1 "aac" 2 "aba" 1 "abc" 1 "aca" 2 "baa" 1 "bca" 1 "caa" 2 "cab" 1 0.5000 0.1667 0.3333
Determining the likelihood of a new sequence. Next, we will want to use the Markov model that you constructed from data to compute a measure of how well that model fits a new text sequence. Later, we will use this measure to determine which of two models (one for Bush and one for Kerry) better fits a test sequence. To compute a measure of fit, we will compute the probability of the model generating the new sequence, a quantity usually called the likelihood of the sequence under the model. For each symbol c in the sequence, we can compute the probability of observing c under the model, given its k-letter context p. In particular, this is just the Laplace-smoothed estimate given above. To compute the likelihood, or probability of the entire sequence, we can multiply these probabilities together, for all symbols in the sequence. Although correct mathematically, the resulting product might be indistinguishable from zero using floating point arithmetic. Therefore, rather than multiplying probabilities, you will need to work with log probabilities. That is, for each symbol c in the sequence, you will need to compute the log of its probability, and you will then add (rather than multiply) these log probabilities to arrive at the final log likelihood of the sequence. Keep in mind that all of these log probabilities will be negative (since probabilities are never more than one).
Your second job is to compute, for a given string, and for each position i in the string, an estimate of the log probability of the character at position i, given its k-character context. The log likelihood of this entire sequence can then be computed as the sum of these log probabilities.
For instance, continuing the example above, having constructed the second order (k = 2) Markov model, suppose we are given a new string "aabca". We can compute the log probability of each of the four symbols in this string as follows (where the first column gives the context of each of the symbols in the second column):
context c log probability ------------------------------------------------ "aa" 'b' log((1 + 1) / (3 + 3)) = -1.0986 "ab" 'c' log((1 + 1) / (2 + 3)) = -0.9163 "bc" 'a' log((1 + 1) / (1 + 3)) = -0.6931 "ca" 'a' log((2 + 1) / (3 + 3)) = -0.6931 "aa" 'a' log((0 + 1) / (3 + 3)) = -1.7918 TOTAL log likelihood = -5.1930 AVERAGE log likelihood = -1.0386
One thing to notice here: To handle the beginning and end of the string, we again treat the string as circular. Thus, in "aabca", the character following the context "ca" is 'a'.
Choosing the best model. Now that we have a method of measuring the fit of a model to a particular string, we can choose the best model to be the one that maximizes the likelihood of the string. In other words, we can simply measure the likelihood of the string under each model (Bush or Kerry) and choose the one that is greatest.
Write a program BestModel.java as follows: its first argument is the order parameter k; its next two arguments are the names of the files containing excerpts from the two candidates (or other text that you might choose to experiment with). The remaining arguments are the names of the files containing text sequences you wish to classify. Your program should build models for each of the two candidates, read in each text sequence, and classify it using the output format described below.
Input format. Each file is a sequence of Unicode characters. We provide data files from the three presidential debates. You should train using the first two debates and test on the third. For amusement, you might also see how your Bush/Kerry models predict when attempting to classify quotes from John Edwards and Dick Cheney taken from the vice presidential debate. Read the files using the In.java library that we provide.
In in = new In("filename.txt"); String s = in.readAll();
Output format. For each text sequence, your program should output the filename of the text block and its average likelihood under each model (that is, the likelihood as computed above, divided by the length of the sequence -- this is simply to prevent unusually long sequences from seeming more significant than they are). Also, print the difference of these two numbers; the sign of the difference will indicate which model is more likely, and the absolute value of the difference can be interpreted as a measure of confidence in this prediction.
To get an idea of why the algorithm is making its predictions, you should also find and print the ten positions in the test sequence where the difference in the log probabilities of the two models are greatest. That is, if the log probability of character i (given its context) under model j is lj[i], then you should find the 10 indices i for which |l1[i] - l2[i]| is greatest. For each of these, print i's context, character i, l1[i], l2[i] and l1[i] - l2[i]. For instance, you might get output like this:
%java BestModel 2 bush1+2.txt kerry1+2.txt bush3-00.txt bush3-01.txt bush3-00.txt -2.1131 -2.1895 +0.0764 "spr" -2.715 -4.963 +2.247 "eek" -5.617 -3.457 -2.160 "eek" -5.617 -3.457 -2.160 "ari" -6.011 -3.901 -2.110 "siv" -3.843 -5.900 +2.057 "siv" -3.843 -5.900 +2.057 "siv" -3.843 -5.900 +2.057 "ban" -2.338 -4.174 +1.836 "sig" -2.744 -4.514 +1.769 "fea" -2.645 -4.401 +1.756 bush3-01.txt -2.1467 -2.1607 +0.0140 "Of " -2.351 -4.331 +1.979 " Of" -2.543 -4.382 +1.839 " My" -1.674 -3.308 +1.634 "My " -1.376 -2.983 +1.607 "sal" -4.956 -3.418 -1.538 "sal" -4.956 -3.418 -1.538 "nen" -2.275 -3.703 +1.428 " Go" -2.970 -4.357 +1.386 "k I" -5.380 -4.003 -1.377 "dis" -1.946 -3.292 +1.346
The first line indicates that for the file bush3-00.txt, the average likelihood under the model built using bush1+2.txt was -2.1131, and -2.1895 under the kerry1+2.txt model. The difference of these numbers is +0.0764, which, being positive, indicates an overall prediction that this quote was uttered by Bush. The next line indicates that the most significant difference in log probabilities between the two models came in predicting 'r' under context "sp" where the Bush log probability was -2.715 and the Kerry log probability was -4.963, the difference being +2.247.
When printing characters from the text, you should convert all white space to ordinary spaces.
Analysis. Analyze your approach to this problem giving estimates of its time and space requirements. Also, critically explore the effectiveness of this technique for this task.
Submission. Your main program should be named BestModel.java. You should also have a data type MarkovModel.java that implements the interface described above. Also submit any other files needed by your program (excluding those in stdlib.jar and adt.jar). Finally, submit a readme.txt file and answer the questions.
Extra credit I. For extra credit, you can apply this program to other related tasks (for instance, distinguishing two authors, distinguishing a van Gogh painting from a fake, distinguising spam from ham). Write a description of what you did and submit any data files that you use.
Extra credit II. For extra credit, be the first student to email Kevin a version of the 2008 Presidential and Vice Presidential debates in the input format of our other debate files. You can download the transcripts from www.debates.org.
The extra credit has been claimed.
This assignment was developed by Rob Schapire (and slightly adapted by Kevin Wayne).