COS 126 Markov Model of Natural Language |
Programming Assignment Checklist Assignment Page |
|
What are the main goals of this assignment? Use a symbol table, learn about natural language processing, and learn about Markov models.
What preparation do I need before beginning this assignment? Review the beginning of Section 4.4 (pages 608–617) on using symbol tables in client code. For reference, here is a partial API of the ST data type. Also review the material on the textbook on parameterized data type and generics (pages 566–570). The Key type can be any comparable type, such as String or Integer.
What is the origin of the Markov text generator? It was first described by Claude Shannon in 1948. The first computer version was apparently written by Don P. Mitchell, adapted by Bruce Ellis, and popularized by A. K. Dewdney in the Computing Recreations section of Scientific American. Brian Kernighan and Rob Pike revived the program in a University setting and described it as an example of design in The Practice of Programming. The program is also described in Jon Bentley's Programming Pearls.
Do I have to implement the prescribed API? Yes, or you will lose a substantial number of points.
|
My program won't compile because the compiler can't find the symbol "ST." How can I fix that? The ST data type is not part of Java or stdlib.jar. You need to download ST.java—it's part of markov.zip—and put it in the same directory as MarkovModel.java.
Can I use java.util.TreeMap or java.util.HashMap for my symbol table instead of ST? No, please use ST.
I want to refer to the magic number 128 in more than one method. Any advice? You can define a global constant, as follows:
This code should appear inside the class block, but before any instance variables or methods. By convention, constant variable names should be ALL_UPPERCASE. The static modifier means that every instance method will refer to the same variable (as opposed to instance variables, for which there is one variable per object); the final modifier means that you can't change its value, once initialized.private static final int ASCII = 128;
How do I extract a substring of a given string? The method call s.substring(i, i+k) returns the the k-character substring of s, starting at i. Note that it includes the left endpoint but excludes the right endpoint.
How do I emulate the behavior of a circular string? There are a number of approaches. One way is to append the first k characters of the input text to the input text.
How do I convert a char to an int? A char is a 16-bit (unsigned) integer. Java will automatically promote a char to an int if you use it as an index into an array.
How do I specify the newline character? Use either the character literal '\n' or the string literal "\n".
How do I use StdRandom.discrete()? There are two overloaded methods named StdRandom.discrete().
|
Should my program generate a different output each time I run it? Yes.
How can I read in the input text from standard input? Use StdIn.readAll(). Do not remove whitespace.
My random text ends in the middle of a sentence. Is that OK? Yes, that's to be expected.
After executing the program, the command prompt appears on the same line as the random text. Is that OK? Yes. It's because the random text does not end with a newline character. If you want to add a call to StdOut.println() at the end of your program, that's fine—we promise not to deduct.
For which values of k should my program work? It should work for all well defined values of k, from 0 up to, and including, the length of the input text. As k gets larger, your program will use more memory and take longer.
My program is slow when T is large because I am concatenating the T characters, one at a time, to form string of length T. What else can I do? Do you need to form the entire string? Why not print the characters, one at a time, as you generate them?
I get an OutOfMemoryException. How do I tell Java to use more of my computer's memory? Depending on your operating system, you may have to ask the Java Virtual Machine for more main memory beyond the default.
The 500m means 500MB, and you should adjust this number depending on the size of the input text.% java-introcs -Xmx500m TextGenerator 7 1000 < input.txt
|
Thoroughly test your MarkovModel. We provide a main() as a start to your testing.
If your method is working properly, you will get the following output:public static void main(String[] args) { String text1 = "banana"; MarkovModel model1 = new MarkovModel(text1, 2); StdOut.println("freq(\"an\", 'a') = " + model1.freq("an", 'a')); StdOut.println("freq(\"na\", 'b') = " + model1.freq("na", 'b')); StdOut.println("freq(\"na\", 'a') = " + model1.freq("na", 'a')); StdOut.println("freq(\"na\") = " + model1.freq("na")); StdOut.println(); String text3 = "one fish two fish red fish blue fish"; MarkovModel model3 = new MarkovModel(text3, 4); StdOut.println("freq(\"ish \", 'r') = " + model3.freq("ish ", 'r')); StdOut.println("freq(\"ish \", 'x') = " + model3.freq("ish ", 'x')); StdOut.println("freq(\"ish \") = " + model3.freq("ish ")); StdOut.println("freq(\"tuna\") = " + model3.freq("tuna")); }
% java-introcs MarkovModel freq("an", 'a') = 2 freq("na", 'b') = 1 freq("na", 'a') = 0 freq("na") = 2 freq("ish ", 'r') = 1 freq("ish ", 'x') = 0 freq("ish ") = 3 freq("tuna") = 0
To test random(), write a loop that calls random() repeatedly, and count how many times a particular character is returned. For example, with model1 above, random("na") should return 'b' about one-half of the time; with model2 above, random("fish") should return 'o' about one-quarter of the time.
|
Be sure to test TextGenerator with different values of k.
% java-introcs TextGenerator 0 100 < input17.txt gaaagaacagcagacgacggaagaaggaggaaaaggaggggaggggggaggaggaagggagaaaggagacagcggaggggacgggaggagaggaggagag
% java-introcs TextGenerator 2 3 < input17.txt gag
|
These are purely suggestions for how you might make progress. You do not have to follow these steps.
Do not save the original text (or the circular text) as an instance variable because no instance method will need this information after the symbol table is initialized.
String text1 = "banana"; MarkovModel model1 = new MarkovModel(text1, 2); ... String text2 = "gagggagaggcgagaaa"; MarkovModel model2 = new MarkovModel(text2, 2);
// st2 is the second symbol table (corresponding to the two-argument freq() method) String result = ""; for (String key : st2) { result += key + ": "; // get the character frequency array int[] frequency = st2.get(key); // for each non-zero entry, append the character and the frequency ... // append a newline character result += "\n"; }
Sample output for model1:String text1 = "banana"; MarkovModel model1 = new MarkovModel(text1, 2); StdOut.println(model1); ... String text2 = "gagggagaggcgagaaa"; MarkovModel model2 = new MarkovModel(text2, 2); StdOut.println(model2);
ab: a 1 an: a 2 ba: n 1 na: b 1 n 1Sample output for model2:
aa: a 1 g 1 ag: a 3 g 2 cg: a 1 ga: a 1 g 4 gc: g 1 gg: a 1 c 1 g 1
|
These are purely suggestions for how you might make progress. You do not have to follow these steps.
|