Programming Assignment Checklist: Markov Model of Natural Language

Frequently Asked Questions (general)

What are the main goals of this assignment? Use a symbol table, learn about natural language processing, and learn about Markov models.

What preparation do I need before beginning this assignment? Review the beginning of Section 4.4 (pages 608–617) on using symbol tables in client code. For reference, here is a partial API of the ST data type. Also review the material on the textbook on parameterized data type and generics (pages 566–570). The Key type can be any comparable type, such as String or Integer.

What is the origin of the Markov text generator? It was first described by Claude Shannon in 1948. The first computer version was apparently written by Don P. Mitchell, adapted by Bruce Ellis, and popularized by A. K. Dewdney in the Computing Recreations section of Scientific American. Brian Kernighan and Rob Pike revived the program in a University setting and described it as an example of design in The Practice of Programming. The program is also described in Jon Bentley's Programming Pearls.

Do I have to implement the prescribed API? Yes, or you will lose a substantial number of points.

Frequently Asked Questions (MarkovModel)

My program won't compile because the compiler can't find the symbol "ST." How can I fix that? The ST data type is not part of Java or stdlib.jar. You need to download ST.java—it's part of markov.zip—and put it in the same directory as MarkovModel.java.

Can I use java.util.TreeMap or java.util.HashMap for my symbol table instead of ST? No, please use ST.

I want to refer to the magic number 128 in more than one method. Any advice? You can define a global constant, as follows:

private static final int ASCII = 128;
This code should appear inside the class block, but before any instance variables or methods. By convention, constant variable names should be ALL_UPPERCASE. The static modifier means that every instance method will refer to the same variable (as opposed to instance variables, for which there is one variable per object); the final modifier means that you can't change its value, once initialized.

How do I extract a substring of a given string? The method call s.substring(i, i+k) returns the the k-character substring of s, starting at i. Note that it includes the left endpoint but excludes the right endpoint.

How do I emulate the behavior of a circular string? There are a number of approaches. One way is to append the first k characters of the input text to the input text.

How do I convert a char to an int? A char is a 16-bit (unsigned) integer. Java will automatically promote a char to an int if you use it as an index into an array.

How do I specify the newline character? Use either the character literal '\n' or the string literal "\n".

How do I use StdRandom.discrete()? There are two overloaded methods named StdRandom.discrete().

Frequently Asked Questions (TextGenerator)

Should my program generate a different output each time I run it? Yes.

How can I read in the input text from standard input? Use StdIn.readAll(). Do not remove whitespace.

My random text ends in the middle of a sentence. Is that OK? Yes, that's to be expected.

After executing the program, the commmand prompt appears on the same line as the random text. Is that OK? Yes. It's because the random text does not end with a newline character. If you want to add a call to StdOut.println() at the end of your program, that's fine—we promise not to deduct.

For which values of k should my program work? It should work for all well defined values of k, from 0 up to, and including, the length of the input text. As k gets larger, your program will use more memory and take longer.

My program is slow when T is large because I am concatenating the T characters, one at a time, to form string of length T. What else can I do? Do you need to form the entire string? Why not print the characters, one at a time, as you generate them?

I get an OutOfMemoryException. How do I tell Java to use more of my computer's memory? Depending on your operating system, you may have to ask the Java Virtual Machine for more main memory beyond the default.

% java-introcs -Xmx500m TextGenerator 7 1000 < input.txt
The 500m means 500MB, and you should adjust this number depending on the size of the input text.

Testing (MarkovModel)

Thoroughly test your MarkovModel. We provide a main() as a start to your testing.

    public static void main(String[] args) {
        String text1 = "banana";
        MarkovModel model1 = new MarkovModel(text1, 2);
        StdOut.println("freq(\"an\", 'a')    = " + model1.freq("an", 'a'));
        StdOut.println("freq(\"na\", 'b')    = " + model1.freq("na", 'b'));
        StdOut.println("freq(\"na\", 'a')    = " + model1.freq("na", 'a'));
        StdOut.println("freq(\"na\")         = " + model1.freq("na"));
        StdOut.println();

        String text3 = "one fish two fish red fish blue fish"; 
        MarkovModel model3 = new MarkovModel(text3, 4);
        StdOut.println("freq(\"ish \", 'r') = " + model3.freq("ish ", 'r'));
        StdOut.println("freq(\"ish \", 'x') = " + model3.freq("ish ", 'x'));
        StdOut.println("freq(\"ish \")      = " + model3.freq("ish "));
        StdOut.println("freq(\"tuna\")      = " + model3.freq("tuna"));
    }
If your method is working properly, you will get the following output:
% java-introcs MarkovModel
freq("an", 'a')    = 2
freq("na", 'b')    = 1
freq("na", 'a')    = 0
freq("na")         = 2

freq("ish ", 'r') = 1
freq("ish ", 'x') = 0
freq("ish ")      = 3
freq("tuna")      = 0

To test random(), write a loop that calls random() repeatedly, and count how many times a particular character is returned. For example, with model1 above, random("na") should return 'b' about one-half of the time; with model2 above, random("fish") should return 'o' about one-quarter of the time.

Testing (TextGenerator)

Be sure to test TextGenerator with different values of k.

Possible Progress Steps (MarkovModel)

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Download and unzip markov.zip. Make sure that ST.java and the testing .txt files are in the same directory as the .java files you are writing.

  2. Review the material in the textbook on symbol tables as well as IntegerSort.java.

  3. Create one or more instance variables to support the two freq() methods. One strategy is to maintain two symbol tables—one for the one-argument freq() method and one for the two-argument freq() method.

  4. Write the constructor to create the circular version of the input text. Then initialize and populate your symbol tables, using the symbol-table methods contains(), get(), and put(). This will be a substantial amount of code, relative to the other methods in this class.

    Do not save the original text (or the circular text) as an instance variable because no instance method will need this information after the symbol table is initialized.

  5. Test: In the main(), try creating some MarkovModel objects. For example:
            String text1 = "banana";
            MarkovModel model1 = new MarkovModel(text1, 2);
            ...
            String text2 = "gagggagaggcgagaaa";
            MarkovModel model2 = new MarkovModel(text2, 2);
    

  6. Write the toString() method. Use the enhanced for loop to access each key–value pair in your symbol table:
            // st2 is the second symbol table (corresponding to the two-argument freq() method)
            String result = "";
            for (String key : st2) {
                result += key + ": ";
         
                // get the character frequency array
                int[] frequency = st2.get(key);
    
                // for each non-zero entry, append the character and the frequency
                ...          
    
                // append a newline character
                result += "\n";
             }
    

  7. Test the constructor and toString() method: This can help you debug small test cases. In the main(), print some MarkovModel objects. For example:
            String text1 = "banana";
            MarkovModel model1 = new MarkovModel(text1, 2);
            StdOut.println(model1);
            ...
            String text2 = "gagggagaggcgagaaa";
            MarkovModel model2 = new MarkovModel(text2, 2);
            StdOut.println(model2);
    
    Sample output for model1:
    ab: a 1 
    an: a 2 
    ba: n 1 
    na: b 1 n 1 
    
    Sample output for model2:
    aa: a 1 g 1 
    ag: a 3 g 2 
    cg: a 1 
    ga: a 1 g 4 
    gc: g 1 
    gg: a 1 c 1 g 1 
    

  8. Write the order() method. This should be a one-liner.

  9. Using the symbol-table instance variables, write the two freq() methods.

  10. Use the main() provided above to test your code before continuing. You should add more testing of the constructor, order(), and freq() methods.

  11. Write the random() method. To generate a random character with probability proportional to its frequency you may call either of the two StdRandom.discrete() methods.

  12. It may not be obvious from your final results whether random() is working as intended, so be sure to test it thoroughly. Next, test your complete MarkovModel data type before continuing.

Possible Progress Steps (TextGenerator)

These are purely suggestions for how you might make progress. You do not have to follow these steps.

  1. Read in k and T from the command line; read the input text from standard input.

  2. Create a MarkovModel object of order k using the input text.

  3. To generate a trajectory of length T, use the first k characters in the input text as the initial k-gram and print the initial k-gram. Then, repeatedly generate and print a new random character according to the Markov model and update the k-gram to store the last k characters printed.

  4. Make sure to test your program on large inputs (we provide several), large outputs, and different values of k.

Enrichment