Goals
Solve a text database searching problem.
Learn to consider the performance characteristics of different algorithms that solve the same problem.
Learn to use binary search trees.
Checklist
Both the BST method and the qsort method are capable of solving the text-indexing problem for the supplied data in a reasonable amount of time. We encourage you to think of your own method for solving the problem. If it's faster, great. If not, that's OK too, since you will have learned more by experimenting.
Use the submit command:
/u/cs126/bin/submit 7 readme index.c
Also submit any other auxilliary files that are needed to compile your program, e.g.,
BST.c
(It's not necessary to submit a brute force solution, if you submit a different method.)
readme
name, precept number
Give a high-level description of the basic approach you took to solve the problem. Indicate advantages and disadvantages of your method, and why you chose it.
for each file you submit, indicate a high level description and any problems encountered
let us know how to compile your code (all files necessary should be included)
output of
query0
with
corpus0
,
query1
with
corpus1
, and first and last 15 lines of output from
query2
with
corpus2
(for
query0
you should get the following
answer
)
indicate how long it takes for
query2
to complete - use the Unix
/usr/bin/time
command, e.g.,
/usr/bin/time a.out corpus2 query2 > out
executing
Your program should have the following calling syntax:
a.out corpus0 query0
For the large files, you should redirect the output to a file using
a.out corpus2 query2 > out
. Otherwise, the output will be a blur. Also, printing the output directly to the screen will take more time than the actual computation!
reading in input
You should be able to recycle the code from Assignment 6 to read in the two files. (Don't forget that when you write
while((c = getc(file)) != EOF)
that
c
must be an
int
and not a
char
.) Instead, you may want to consider using the function
fgets
for reading in the query file.
The large files
corpus2
and
query2
contain less than 1,200,000 and 400,000 characters respectively.
Convert newline characters to spaces when reading in corpus (otherwise you won't be able to match phrases that span multiple lines). The query strings in the query files are delimited by newline characters.
Make sure you get this part debugged before continuing - as sanity checks you can use
printf
and
strlen
to make sure the strings have been read in correctly.
brute.c
If you choose to only turn in the brute force method, do not use any string library functions (e.g.,
strchr
), although you may still want to use them for debugging purposes.
Try your brute force method on
corpus2
and
query2
, but after seeing how slow it is, use
Ctrl-c
to stop. Otherwise, you will waste hours of shared computing resources.
BST approach
It is reccommended that you start with brute force approach if you are not totally confident with strings.
Can start with the Sedgewick code. Some people find it easier to start from scratch.
The string library function
strcmp
,
strncmp
, and
strlen
may be useful. (Remember that
strlen
is not free - do not excessively overuse.)
Use
lcc -n
to check if your program accidentally dereferences any null pointers.
quicksort approach
One obstacle is understanding the library functions
bsearch
and
qsort
. This will require advanced knowledge of pointers. If you use this method, you must document and explain how pointers are being used, i.e., it is not sufficient to use trial and error to figure out where to put the
&
's and
*
's - you must understand why.
Alternatively, you can write your own sorting and binary search routines.
Be sure to find the
first
occurrence of each query string. You will have to modify the binary search method.
hashing
This is an approach similar to the
code
and
decode
method from Assignment 6. Now the keys are much longer, so you will not be able to find a one-to-one mapping between keys and integers (e.g.,
aaa = 0, aac = 1, ..., ttt = 63
).
Write a
hash
function that converts the first
k
characters into an integer between 0 and
T
. (You will have to carefully determine good values for
k
and the table size
T
.)
Now, several strings may "hash" to the same integer. So, for each hash value you will need to store a list of corpus strings that match. When you search for a particular query string, you determine its hash value. It now suffices to search the list corresponding to that hash value only. This list will hopefully be very small, and brute force search of this list will be fast.
Be careful about handling query phrases less than
k
letters long - your code may need some minor modificiations.
other approaches
Too numerous to list. You are encouraged to experiment.
Some may be much faster on certain inputs, some will be much slower, some may be more complicated, some may be simpler. None is perfect. Learn to make tradeoffs in computing and life.
profiling
Feel free to experiment with the
lcc -b
profiling capability to see how many times each line of your code gets executed (see Lecture Note 23.4). It's easy to use and will enable you to quickly pinpoint where the bottleneck computation is in your algorithm.
extra credit
Find all occurrences of a given query string. Depending on your method, this may require only 5 or so extra lines of code, so give it a shot. No extra credit awarded here if you do a brute force approach.
Here's an easy but "malicious" input with only 1000 words in the corpus and 7000 query phrases:
query3
,
corpus3
. Design an improved algorithm that does not grind to halt, even on "pathological" inputs.