COS 126 Assignment 7 Checklist

Goals

Solve a text database searching problem.
Learn to consider the performance characteristics of different algorithms that solve the same problem.
Learn to use binary search trees.

Checklist

Both the BST method and the qsort method are capable of solving the text-indexing problem for the supplied data in a reasonable amount of time. We encourage you to think of your own method for solving the problem. If it's faster, great. If not, that's OK too, since you will have learned more by experimenting.

Use the submit command: /u/cs126/bin/submit 7 readme index.c Also submit any other auxilliary files that are needed to compile your program, e.g., BST.c (It's not necessary to submit a brute force solution, if you submit a different method.)

readme

name, precept number

Give a high-level description of the basic approach you took to solve the problem. Indicate advantages and disadvantages of your method, and why you chose it.

for each file you submit, indicate a high level description and any problems encountered

let us know how to compile your code (all files necessary should be included)

output of query0 with corpus0, query1 with corpus1, and first and last 15 lines of output from query2 with corpus2 (for query0 you should get the following answer)

indicate how long it takes for query2 to complete - use the Unix /usr/bin/time command, e.g., /usr/bin/time a.out corpus2 query2 > out

executing

Your program should have the following calling syntax: a.out corpus0 query0

For the large files, you should redirect the output to a file using a.out corpus2 query2 > out. Otherwise, the output will be a blur. Also, printing the output directly to the screen will take more time than the actual computation!

reading in input

You should be able to recycle the code from Assignment 6 to read in the two files. (Don't forget that when you write while((c = getc(file)) != EOF) that c must be an int and not a char.) Instead, you may want to consider using the function fgets for reading in the query file.

The large files corpus2 and query2 contain less than 1,200,000 and 400,000 characters respectively.

Convert newline characters to spaces when reading in corpus (otherwise you won't be able to match phrases that span multiple lines). The query strings in the query files are delimited by newline characters.

Make sure you get this part debugged before continuing - as sanity checks you can use printf and strlen to make sure the strings have been read in correctly.

brute.c

If you choose to only turn in the brute force method, do not use any string library functions (e.g., strchr), although you may still want to use them for debugging purposes.

Try your brute force method on corpus2 and query2, but after seeing how slow it is, use Ctrl-c to stop. Otherwise, you will waste hours of shared computing resources.

BST approach

It is reccommended that you start with brute force approach if you are not totally confident with strings.

Can start with the Sedgewick code. Some people find it easier to start from scratch.

The string library function strcmp, strncmp, and strlen may be useful. (Remember that strlen is not free - do not excessively overuse.)

Use lcc -n to check if your program accidentally dereferences any null pointers.

quicksort approach

One obstacle is understanding the library functions bsearch and qsort. This will require advanced knowledge of pointers. If you use this method, you must document and explain how pointers are being used, i.e., it is not sufficient to use trial and error to figure out where to put the &'s and *'s - you must understand why.

Alternatively, you can write your own sorting and binary search routines.

Be sure to find the first occurrence of each query string. You will have to modify the binary search method.

hashing

This is an approach similar to the code and decode method from Assignment 6. Now the keys are much longer, so you will not be able to find a one-to-one mapping between keys and integers (e.g., aaa = 0, aac = 1, ..., ttt = 63).

Write a hash function that converts the first k characters into an integer between 0 and T. (You will have to carefully determine good values for k and the table size T.)

Now, several strings may "hash" to the same integer. So, for each hash value you will need to store a list of corpus strings that match. When you search for a particular query string, you determine its hash value. It now suffices to search the list corresponding to that hash value only. This list will hopefully be very small, and brute force search of this list will be fast.

Be careful about handling query phrases less than k letters long - your code may need some minor modificiations.

other approaches

Too numerous to list. You are encouraged to experiment.

Some may be much faster on certain inputs, some will be much slower, some may be more complicated, some may be simpler. None is perfect. Learn to make tradeoffs in computing and life.

profiling

Feel free to experiment with the lcc -b profiling capability to see how many times each line of your code gets executed (see Lecture Note 23.4). It's easy to use and will enable you to quickly pinpoint where the bottleneck computation is in your algorithm.

extra credit

Find all occurrences of a given query string. Depending on your method, this may require only 5 or so extra lines of code, so give it a shot. No extra credit awarded here if you do a brute force approach.

Here's an easy but "malicious" input with only 1000 words in the corpus and 7000 query phrases: query3, corpus3. Design an improved algorithm that does not grind to halt, even on "pathological" inputs.