Pair programming.
On this assignment, just like the last one,
you are encouraged (not required) to work with
a partner provided you practice pair programming
(same rules as last assignment).
|
What are the main goals of this assignment? Use a symbol table, learn about natural language processing, and learn about Markov models.
What is the origin of the Markov text generator? It was first described by Claude Shannon in 1948. The first computer version was apparently written by Don P. Mitchell, adapted by Bruce Ellis, and popularized by A. K. Dewdney in the Computing Recreations section of Scientific American. Brian Kernighan and Rob Pike revived the program in a University setting and described it as an example of design in The Practice of Programming. The program is also described in Jon Bentley's Programming Pearls.
Do I have to implement the prescribed API? Yes, or you will lose a substantial number of points.
How do I read in the input text from standard input? Use StdIn.readAll().
Given a string s, is there an efficient way to extract a substring? Use s.substring(i, i+k) to get the k-character substring starting at i.
How do I emulate the behavior of a circular string? There are a number of approaches. One way is to concatenate the first k characters to the end of the input text.
How do I convert a char to an int? A char is a 16-bit (unsigned) integer. Java will automatically promote a char to an int if you use it as an index into an array.
How do I use StdRandom.discrete() to pick the next character? The argument to StdRandom.discrete() needs to be a double[] holding the probabilities of each character being next. StdRandom.discrete() will return an integer which is the index of the random pick based on those probabilities. That index is the integer value of the next character.
My rand() method calls StdRandom.discrete() as recommended in the Possible Progress Steps, but I get the following error message when I run: java.lang.AssertionError. What does that mean? It means that the probabilities don't sum up to 1. Double check how you are computing the values for the array you send to StdRandom.discrete(). The array elements are the probabilities of each possible event, so the sum of the array elements should be 1. (To learn how to use assertions, see pp. 446-447.)
Should my program generate a different output each time I run it? Yes.
My random text ends in the middle of a sentence. Is that OK? Yes, that's to be expected. We recommend using a StdOut.println(); statement to ensure that there is a new line generated after the last character.
For which values of k should my program work? It should work for all well-defined values of k from and including 0 to and including the length of the input text. Naturally, as k gets larger, your program will use more memory and take longer.
I get an OutOfMemoryException. How do I tell Java to use more of my computer's memory? Depending on your operating system, you may have to ask the Java Virtual Machine for more main memory beyond the default.
The 100m means 100MB, and you should adjust this number depending on the size of the input text.% java -Xmx100m TextGenerator 7 1000 < input.txt
What is a StringBuilder object? StringBuilder is part of the standard Java library. It is an object that we use because of its more efficient handling of large strings. Here is a subset of the StringBuilder API with some methods you might find useful for this assignment. You do not need to use all of them.
public class StringBuilder ------------------------------------------------------------------------------------------------------- StringBuilder(String s) // create a StringBuilder initialized to the contents of String s StringBuilder append(char c) // append the string representation of the char to this sequence String toString() // returns a String representing the data in this sequence String substring(int i, int j) // returns the length (j-i) substring including position i, // up to and excluding position j (like String.substring())
|
Thoroughly test your MarkovModel. We provide a main() as a start to your testing.
If your method is working properly, you will get the following output:public static void main(String[] args) { MarkovModel mod1 = new MarkovModel("i am sam. sam i am", 3); StdOut.println("freq(\"sam\", ' ') = " + mod1.freq("sam", ' ')); StdOut.println("freq(\"sam\", '.') = " + mod1.freq("sam", '.')); StdOut.println("freq(\"mi \") = " + mod1.freq("mi ")); StdOut.println("freq(\"sam\") = " + mod1.freq("sam")); StdOut.println(); String text = "now is the time. now is the time to eat. " + "now is the time to live."; MarkovModel mod2 = new MarkovModel(text, 7); StdOut.println("freq(\"now is \", ' ') = " + mod2.freq("now is ", ' ')); StdOut.println("freq(\"now is \", 't') = " + mod2.freq("now is ", 't')); StdOut.println("freq(\"now is \") = " + mod2.freq("now is ")); }
% java MarkovModel freq("sam", ' ') = 1 freq("sam", '.') = 1 freq("mi ") = 1 freq("sam") = 2 freq("now is ", ' ') = 0 freq("now is ", 't') = 3 freq("now is ") = 3
Note that this does not test your rand() or gen() methods.
You can use print statements to test rand() to make sure that each non-zero entry of the array passed to StdRandom.discrete() is accurate. (But make sure to remove these print statements from your rand() method before submitting your code.)
To test gen(), use a text that has no repetition (e.g., "abc") and set order k=1. This should only be able to generate text where "a" is followed by "b", "b" is followed by "c", "c" is followed by "a". Then use a text with easy to compute repetition (e.g., "abac"). It should generate "a" followed half the time by "b" and half the time by "c". "b" or "c" should always be followed by "a".
An order-0 Markov model generates a random sequence of letters where each letter appears with probability proportional to its frequency in the input text. For input17.txt there are 9 g's, 7 a's, and 1 c. So we want the probability of generating a 'g' to be 9/17, an 'a' to be 7/17, and a 'c' to be be 1/17. In a sequence of 100 characters, we'd therefore expect on average about 53 g's, 41 a's, and 6 c's.
% java TextGenerator 0 100 < input17.txt gaaagaacagcagacgacggaagaaggaggaaaaggaggggaggggggaggaggaagggagaaaggagacagcggaggggacgggaggagaggaggagag
For input17.txt, the next character after "ga" is 'a' with probability 1/5 and 'g' with probability 4/5. If you run the following command 10 times, you should expect (on average) to see "gag" 8 times and "gaa" 2 times.
% java TextGenerator 2 3 < input17.txt gag
|
These are purely suggestions for how you might make progress. You do not have to follow these steps.
|
Note that backreferences are implemented by most "regular expression" engines, but they are not part of the formal mathematical definition of a regular expression because some expressions involving backreferences cannot be converted to equivalent DFAs.