Programming Assignment Checklist: Markov Model of Natural Language


Pair programming. On this assignment, just like the last one, you are encouraged (not required) to work with a partner provided you practice pair programming (same rules as last assignment).

Frequently Asked Questions

What are the main goals of this assignment? Use a symbol table, learn about natural language processing, and learn about Markov models.

What is the origin of the Markov text generator? It was first described by Claude Shannon in 1948. The first computer version was apparently written by Don P. Mitchell, adapted by Bruce Ellis, and popularized by A. K. Dewdney in the Computing Recreations section of Scientific American. Brian Kernighan and Rob Pike revived the program in a University setting and described it as an example of design in The Practice of Programming. The program is also described in Jon Bentley's Programming Pearls.

Do I have to implement the API as specified? Yes, or you risk losing a substantial number of points.

How do I read in characters from standard input? Use StdIn.readAll().

Given a string s, is there an efficient way to extract a substring? Use s.substring(i, i+k) to get the k-character substring starting at i. In Java, this operation takes constant time, regardless of the size of k.

How do I emulate the behavior of a circular string? There are a number of approaches. One way is to concatenate the first k characters to the end of the input text.

How do I convert a char to an int? A char is a 16-bit (unsigned) integer. Java will automatically promote a char to an int if you use it as an index into an array.

Should my program generate a different output each time I run it? Yes.

My random text ends in the middle of a sentence. Is that OK? Yes, that's to be expected. It's a good idea to add a System.out.println(); statement after the last character you print out.

For which values of k should my program work? It should work for all well-defined values of k from and including 0 to and including the length of the input text. Naturally, as k gets larger, your program will use more memory and take longer.

I get an OutOfMemoryException. How do I tell Java to use more of my computer's memory? Depending on your operating system, you may have to ask the Java Virtual Machine for more main memory beyond the 64MB default.

java -Xmx100m TextGenerator 7 1000 < input.txt
The 100m means 100MB, and you should adjust this number depending on the size of the input text.

Testing

Thoroughly test your MarkovModel. We provide a main as a start to your testing.

public static void main(String[] args) {
    MarkovModel model = new MarkovModel("i am sam. sam i am", 3);

    System.out.println( "Frequency of blank after 'sam' is " 
                        + model.freq("sam", ' ') );
    System.out.println( "Frequency of period after 'sam' is " 
                        + model.freq("sam", ' ') );
    System.out.println( "Frequency of 'mi ' is " + model.freq("mi ") );
    System.out.println( "Frequency of 'sam' is " + model.freq("sam") );
    System.out.println( );

    String text = "now is the time. now is the time to eat. " 
                  + "now is the time to live.";
    model = new MarkovModel(text, 7);

    System.out.println( "Frequency of blank after 'now is '" 
                        + model.freq("now is ", ' ') );
    System.out.println( "Frequency of t after 'now is '" 
                        + model.freq("now is ", 't') );
    System.out.println( "Frequency of 'now is ' is " 
                        + model.freq("now is ") );
}
If your method is working properly, you will get the following output:
Frequency of blank after 'sam' is 1
Frequency of period after 'sam' is 1
Frequency of 'mi ' is 1
Frequency of 'sam' is 2

Frequency of blank after 'now is ' 0
Frequency of t after 'now is ' 3
Frequency of 'now is ' is 3

An order-0 Markov model generates a random sequence of letters where each letter appears with probability proportional to its frequency in the input text. For input17.txt there are 9 g's, 7 a's, and 1 c. So we want the probability of generating a "g" to be 9/17 = 0.53..., an "a" to be 7/17 = 0.41..., and a "c" to be be 1/17 = 0.06... In a sequence of 100 characters, we'd therefore expect on average about 53 g's, 41 a's, and 6 c's.

java TextGenerator 0 100 < input17.txt

gaaagaacagcagacgacggaagaaggaggaaaaggaggggaggggggaggaggaagggagaaaggagacagcggaggggacgggaggagaggaggagag

For input17.txt, the probability of the next character being "a" after "ga" is 0.20 while the probability of "g" is 0.80. If you run the following command 10 times, on average we'd expect to see "gag" 8 times and "gaa" 2 times.

java TextGenerator 2 3 < input17.txt

gag

Possible Progress Steps

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Download the directory markov. It contains a number of sample text files. It also contains ST.java, StdRandom.java, along with this assignment's readme.txt template.

  2. Review the material in the textbook on symbol tables as well as the IntegerSort.java example in 4.2 of the booksite, creative exercise 20.

  3. Create an instance variable for your symbol table where the key is a string and the value is an integer array. Each entry of the integer array is a count of how many times one character appears after the key in the text.

  4. Write your constructor to create the circular version of the text string. Then initialize and populate your symbol table, using the symbol table methods contains(), get() and put().

  5. Using only your symbol table and no other instance variables, write the freq() methods.

  6. Using the main() provided above test this part of your MarkovModel data type before continuing. You may wish to add more testing of the constructor and freq() methods.

  7. Write the rand() method. To generate a random character with probability proportional to its frequency you may call or refer to StdRandom.discrete() on p. 226 of the textbook.

  8. It may not be obvious from your final results if rand() is not working as you intended, so be sure to test it thoroughly. Then test your complete MarkovModel data type before continuing.

  9. Write the client program TextGenerator.java. Create an instance of the MarkovModel class with the string read from StdIn. Print out the first kgram. Then write a loop to generate the remaining character, one character at a time using the methods from MarkovModel.

Enrichment