Design and implement functions that build an order K Markov model from a piece of input text. Markov models are popular tools in speech recognition, handwriting recognition, information retrieval, and data compression. For this assignment, you will consider a different application: generating stylized pseudo-random text.
Markov models. An order 0 Markov model predicts that each character in the alphabet occurs with a fixed probability, independent of previous characters. Given an input text, you compute the Markov model of order 0 by counting up the number of occurrences of each letter in the input, and use these as the frequencies. For example, if the input text is "agggcagcgggcg", then the order 0 Markov model predicts that each character is a with probability 2/13, c with probability 3/13, and g with probability 8/13.
Characters in English text are not independent. If the previous 5 characters in a string are orith, then the next letter is almost surely m. An order K Markov model predicts that each character occurs with a fixed probability, but that probability can depend on the previous k characters. Given an input text, you compute a Markov model of order K by counting up the number of occurrences of each letter that follows each sequence of K letters. For example, if the text has 100 occurrences of th, with 50 occurrences of the, 25 occurrences of thi, 20 occurrences of tha, and 5 occurrences of tho, the order 2 Markov model predicts that the next character following th is e with probability 1/2, i with probability 1/4, a with probability 1/5, and o with probability 1/20.
Part 0: Markov model. Create a data type Markov to represent a K-character substring. Ultimately, it will have a method random that returns a random character according to the Markov model. For now, just make it store the substring and an integer that counts the number of times the substring appears. You will need a constructor, a method to increment the frequency count, and the usual toString method for output.
Part 1: frequency counts. Implement a client program FrequencyCounter.java that reads the order parameter K of the Markov model from the command-line, a text string from standard input, and uses a symbol table to insert each K-character substring (key) from the text. For example, if K is 2 and the input string is "agggcagcgggcg", then your client program should create Markov objects for each of the 5 distinct keys, and call the add method 12 times in total: ag gg gg gc ca ag gc cg gg gg gc cg. Maintain an integer count of the number of occurrences of each key. Use your symbol table's methods to print out the number of distinct keys and the number of times each key appears in the text. For the example above, your program should output:public Markov(String substring) public void add() public String toString()
5 distinct keys 2 ag 1 ca 2 cg 3 gc 4 gg
Part 2: language model. To generate random text, given a K-character key, you data type Markov must know all of the letters that follow the K-character key. This operation is at the crux of the matter, as you will need it to generate random characters in accordance with the Markov model. Modify your Markov data type so that in addition to the frequency counts, it records the breakdown depending on the next letter. Create your own linked list to keep track of the list of suffix characters along with their frequencies. Modify the toString method so that it it prints out the list of suffixes, along with the substring and the frequency count. Include the following method to insert a suffix character.
public void add(char c)
Modify FrequencyCounter.java to create a program LanguageModeler.java that inserts keys into the symbol table (if necessary), and calls add(char c) to add the appropriate suffix characters to the Markov model. It should produce the following output on the example input.
To make things look pretty, convert all whitespace characters into spaces or underscores. Note that since the last cg substring doesn't have a "next" character, we don't include it in the model.5 distinct keys 2 ag: 1 c 1 g 1 ca: 1 g 1 cg: 1 g 3 gc: 1 a 2 g 4 gg: 2 c 2 g
Part 3: pseudo-random text generation. Add a method random to Markov that and returns a pseudo-random character according to the language model. Be sure to get the probabilities right, as we will be checking this.
Now, write a program TextGenerator.java that takes a command line input K, a command line input M, reads in a text string from standard input, and prints out M characters according to the order K Markov model. Start by printing the first K characters of the original text. Then, repeatedly generate successive pseudo-random characters. Using the example above, if the Markov object m represents the substring "gg", then m.random() should return c or g, each with probability 1/2. After you generate a character, move over one character position, always using the last K characters generated to determine the probabilities for the next. For example, if your program chooses c in the example above, then the next Markov object would represent the substring "gc", and according to the Markov model, the next character should be a with probability 1/3 and g with probability 2/3. Continue the process until you have output M characters. If the language model contains less than 100 K-tuples (prefixes), then print the language model (as in the program LanguageModel.java) before you output the M randomly generated characters.
Deliverables. Submit your code as usual, along with one or two of the most amusing language-modeling examples that you come up with. Describe your data structure and algorithms in detail, and analyze the most important performance characteristics of your program in the readme.txt file.
Amusement. Once you get the program working, you will find that the random text with low-order models starts to sound more and more like the original text as you increase the order, as illustrated in the examples below. There are many more apparent pearls of wisdom in the full texts on the web. As you can see, there are limitless opportunities for amusement here. Try your model on some of your own text, or find something interesting on the net. You may wish to write small filters to clean up white space for pure text, but be careful, as the special characters sometimes drive the model, as illustrated in the Shakespeare example.
This assignment was developed by Bob Sedgewick and Kevin Wayne, and is based on a classic idea of Claude Shannon from the 1940s.