COS 226 Programming Assignment 5

Repeat search

Implement a redundancy detector. Design and implement an algorithm for finding the longest repeated sequence in a long string consisting of lower-case letters and spaces. Assume that the text is typical English-language text; not necessarily random, but also necessarily not degenerate. That is, you can't base your algorithm on the string being purely random, but, on the other hand, you needn't worry about degenerate cases (for example, strings consisting of all blanks).

You will first want to implement a brute-force method to solve the problem, to get some idea of how much computation is required and to have a reliable check for short strings when debugging. The brute force algorithm checks the length of the match at every possible pair of starting positions, so it uses a quadratic number of character comparisons, with the constant of proportionality depending on the match lengths. After getting the brute-force method working, you may wish to proceed by improving that method (remove obvious inefficiencies) or by devising some completely different method. This problem is amenable to attack with a variety of the algorithmic tools that we have covered in class, though it's not at all clear which approach will yield the fastest solution.

You may predeclare a large array to hold the string and simply read it in a character at a time, for example, with the statement while (( a[N++] = getchar()) != EOF) ;). You may use a reasonable amount of extra space to help speed up your algorithm, but, as usual, don't throw space away.

Assume that the string contains only lower-case letters and spaces, with upper- and lower-case letters equivalent and strings of nonletters collapsed to a single space. You may choose to modify the input statement above to perform this function, or you can convert an arbitrary input file oldtext to this format with a filter like cat oldtext | tr -cs A-Za-z '' | tr A-Z a-z | a.out If you haven't used tr before, use man tr to find out about it.

The "inner loop" of your program will involve accessing characters in the string, in most cases comparing them, looking for repeat strings. Keep a count of the number of characters accessed done by your algorithm, and provide an estimate of how this count grows, as a function of the string size. A good algorithm will access only a few characters per character in the string, but your first attempts (after brute-force) might look at dozens or even hundreds of characters per character.

Name your program strmatch. Read the string from standard input and print four integers on standard output, in this order: length of match, position of first substring, position of second substring, and number of characters accessed.

Advice. We have studied some very complex algorithms that might do well on this problem, but there are also some relatively simple ideas that are powerful.

Due: 11:59PM Sunday, March 29.