COS 126 Markov Model of Natural Language |
Programming Assignment Due: 11:55pm |
Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text.
Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today's Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including: the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google's PageRank algorithm for Web search. For this assignment, you will consider a whimsical variant: generating stylized pseudo-random text.
Markov model of natural language. Shannon approximated the statistical structure of a piece of text using a simple mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences of each letter in that text, and using these counts as probabilities. For example, if the input text is "agggcagcgggcg", then the Markov model of order 0 predicts that each letter is 'a' with probability 2/13, 'c' with probability 3/13, and 'g' with probability 8/13. The following sequence of letters is a typical example generated from this model.
An order 0 model assumes that each letter is chosen independently. This does not coincide with the statistical properties of English text since there is a high correlation among successive letters in an English word or sentence. For example, the letters 'h' and 'r' are much more likely to follow 't' than either 'c' or 'x'.a g g c g a g g g a g c g g c a g g g g ...
We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters (k-gram). For example, if the text has 100 occurrences of "th", with 60 occurrences of "the", 25 occurrences of "thi", 10 occurrences of "tha", and 5 occurrences of "tho", the Markov model of order 2 predicts that the next letter following the 2-gram "th" is 'e' with probability 3/5, 'i' with probability 1/4, 'a' with probability 1/10, and 'o' with probability 1/20.
A brute force solution. Claude Shannon proposed a brute force scheme to generate text according to a Markov model of order 1.
To construct [an order 1 model] for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage.Your task is to write an efficient Java program to automate this laborious task. Shannon's brute force approach yields a statistically correct result, but, as Shannon observed, it is prohibitively slow when the size N of the input text is large. An alternate approach is to create a "Markov chain" and simulate a trajectory through it. This approach is described below.
Markov chain. For the purpose of this assignment, a Markov chain is comprised of a set of states, one distinguished state called the start state, and a set of transitions from one state to another. The figure below illustrates a Markov chain with 5 states and 14 transitions. Notice that there is no state with zero transitions leaving it; if there were such a state, the model could get "stuck" in it.
The order 3 Markov model of "bbbabbabbbbaba" (see below)
Simulating a Markov chain. To simulate a trajectory through a Markov chain, begin at the start state (the one with an incoming arrow from nowhere). At each step select one of the leaving arcs uniformly at random, and move to the neighboring state. For example, if the Markov chain is in state bab, then it will transition to state abb with probability 3/4 and to state aba with probability 1/4. The following are the first ten steps of a possible trajectory beginning from bbb:
bbb -> bbb -> bba -> bab -> abb -> bbb -> bbb -> bba -> bab -> abb
Random queue data type. Create a data type RandomQueue to represent the list of neighbors and support choosing one uniformly at random. The data type must be generic and implement the following API:
public class RandomQueue<Item> (generic random queue) ------------------------------------------------------------------------------------------ RandomQueue() // create an empty queue void add(Item item) // add the given item to the queue Item sample() // return a random item from the queue String toString() // return a string representation of this queue
Markov chain data type. Create a data type MarkovChain to represent a Markov chain of strings. The data type must implement the following API:
Your implementation of MarkovChain will be similar to GraphLite.java, except that (i) the Markov chain is directed and (ii) it allows parallel arcs (more than one edge between two vertices). Accordingly, use a symbol table of random string queues (key = String, value = RandomQueue<String>) instead of a symbol table of sets (key = String, value = SET<String> of neighboring vertices).public class MarkovChain (Markov chain) ------------------------------------------------------------------------------------------ MarkovChain() // create an empty Markov chain void addTransition(String v, String w) // add a transition from state v to state w String next(String v) // pick a transition leaving state v, uniformly at random, and return the resulting state String toString() // return a string representation of this Markov chain
Text generation. Implement a client program TextGenerator.java that takes two command-line arguments k and M, reads the text from standard input, builds the Markov chain associated with the order k Markov model, and prints out M pseudo-random characters by simulating a trajectory.
bbb -> bbb -> bba -> bab -> abb -> bbb -> bbb -> bba -> bab -> abb bbb b a b b b b a b b
Amusement. Once you get the program working, you will find that the random text generated looks more and more like the original text as you increase the order, as illustrated in the examples below. There are many more apparent pearls of wisdom in the full texts on the Web. As you can see, there are limitless opportunities for amusement here. Try your model on some of your own text, or find something interesting on the net. You need not limit yourself to the English language.
Deliverables. Submit MarkovChain.java, RandomQueue.java and TextGenerator.java, readme.txt, and any other supporting files that your program needs besides ST.java, StdIn.java, and StdOut.java. Also, include two of the most entertaining language-modeling fragments that you discover in your readme.txt.
This assignment was developed by Bob Sedgewick and Kevin Wayne, based on the classic idea of Claude Shannon.