COS 126: Assignment 7. Text indexing

COS 126

Text indexing

Programming Assignment 7

Due: Wednesday, 11:59pm

Write a program that builds an index from a large piece of text, then does fast searches to find the position of ``query'' strings within the text.

Your program should take two filenames as command-line arguments: the text file (which we refer to as the corpus) and a file containing the queries. Assume that both files contain only lower-case letters, blanks, and newline characters, with queries in the query file delimited by newline characters. This is not much of a restriction, since you can convert any file to this format by using tr filters, for example as follows:

cat oldtext | tr -cs A-Za-z ' ' | tr A-Z a-z > newtext

If you haven't used tr before, use man tr to find out about it.

Your program should read the corpus, store it as a (potentially huge) string, and possibly build an index for it, as discussed below. Then it should read queries one by one, assuming one query per line in the second named file on the command line, and print out the position in the corpus where each query first appears. For example, for the corpus /u/cs126/data/corpus0:

  it was the best of times it was the worst of times it was the age of 
  wisdom it was the age of foolishness it was the epoch of belief 
  it was the epoch of incredulity it was the season of light it was 
  the season of darkness it was the spring of hope it was the winter 
  of despair

and the query file /u/cs126/data/query0, the result of a.out corpus0 query0 should be

   69 wisdom
  176 season
   87 age of foolishness
   -- age of fools
   19 times

indicating that wisdom appears at character position 71, age of fools does not appear, the first occurrence of times is at character position 19, and so forth.

There are numerous different ways to solve this problem, all with different characteristics in terms of ease of implementation, space requirements, and time requirements. Part of your task for this assignment is to assimilate this information to help you decide which method to use and how to apply it to this specific task.

Begin with a brute-force implementation based on Program 3.15 in the book. That is, don't build an index at all: just search through the corpus for each query string. This solution is fine if the corpus is tiny, or if there are not many queries. A complete solution along these lines will be worth 7 points for this program. But this method is much too slow when the corpus is huge and there are a huge number of queries, so to get a 10 on this program, you will need to implement a faster solution.

One method for making the search fast is to do a pointer sort on the corpus (one pointer for each character position), then use binary search. If you wish to take this approach, you can use the functions qsort and bsearch from the standard C library. Your main challenges in this approach are to fully understand Program 3.17; develop a similar program to build an index of pointers that access the keys in sorted order (as in Figure 3.13); and figure out the necessary interface to bsearch to use the index to do the queries. In particular, you need appropriate ``comparison'' functions (different ones!) for the sort and the search.

Another possible approach is to start with Program 12.10. This code basically provides a full solution, but to get it working, you have to make a number of small changes, both because it differs from the problem specified here in a number of details, and because various small pieces of code are missing. Again, the ``comparison'' function has to be carefully considered.

The files corpus1 and query1 in /u/cs126/data comprise a larger test case, and the files corpus2 and query2 comprise a huge test case. If your program handles the huge case, great! If it doesn't, try to understand the reasons why and document them in your README file, but don't spend an inordinate amount of time on new code to handle the huge case, because there aren't that many points available between the 7 for the brute-force solution and the 10 for a perfect solution.

Important note: the ``training wheels are off'' on this assignment---you're starting with a blank screen, not our code (although you do have a lot of code from the book to work with). This means that you must pay particular attention to the process of building up your program from scratch. Consider carefully how your program should be structured and how you are going to implement and debug the various functions before plunging in. You may wish to add code just for your own use in debugging (for example, to print out the contents of any data structure that you build). Get the brute-force solution working first. Not only will it provide a framework for more complicated solutions (like the code we usually provide) but also it will give you a way to compute correct output for comparison when debugging more complicated solutions.

Extra Credit: Modify your program to count the number of occurrences of each query string.