COS 226 Programming Assignment #4
Program Report
Your program report should answer the following questions. This should
be submitted electronically using whiteboard along with the rest of your program
in a file called either report.txt (if in plain ascii text) or
report.pdf (if in pdf). Please do not submit using any other formats.
- Explain your overall approach.
- Estimate using big-Oh notation the maximum possible number of unique
k-letter substrings in a sequence of length n over an alphabet of size
S.
Give the best bound you can think of in terms only of k, n and
S. You can regard k as a constant.
- Estimate using big-Oh notation how much time and space your program
requires to construct a Markov model from a data file of total size N
(possibly containing multiple text blocks), where
k is the order parameter and S is the alphabet size. Also estimate the
time and space requirements to compute the log likelihood of a new test
string of length n under such a model. You can regard k
as a constant, and you can assume that hash tables require constant time for
each operation, and only take up space proportional to the number of keys
they contain.
- Train your program using the first two debates. Then, using data
from the third debate, create a table showing how often your program
misclassified a Bush quote (i.e., incorrectly attributed the quote to
Kerry), and how often it misclassified a Kerry quote, giving both as a function of the
order parameter k. Use a range of value for k, such as: 0, 1, 2,
3, 5, 7, 10, 20, 50.
- Write a brief paragraph exploring the effectiveness of this technique
for this problem. You can do this however you wish, but try to be
creative, thoughtful, perceptive, critical and objective. Your
observations can be quantitative and systematic, or they can be more
anecdotal. You are welcome (but not required) to do more experiments
besides the one in the last question. Feel free to submit graphical
plots, if appropriate. You might think about questions like the following:
Which values of k seem to be most effective? What about the behavior
of this technique seemed to be what you would have expected, and what
aspects seemed surprising? At an intuitive level, how does the method
seem to be working (or not working)? What features of the text seem to
have the greatest influence on its effectiveness? How does this
approach to this problem compare to how a human might solve it? What
ideas do you have for improving it?
- Any known bugs / limitations?
- List whatever help (if any) that you received.
- Describe any serious problems you encountered.
- Any other comments or feedback?
- If you created and explored any other datasets for extra credit, please
tell us about them and describe what you found.