You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.
Problem 1 Consider an inverted file of occurrence lists accessed through a B+ tree on the index terms. Each leaf of the B+ tree holds a sublist of alphabetically consecutive index terms, and, with each term, a pointer to the list of occurrences within documents for that term. Suppose there are 16 million index terms for a collection of 32 million documents of total size 200 gigabytes. We would like each internal node of the B+ tree and each leaf of the B+ tree to fit in one 8-kilobyte page of the file system.
Part a Recall that a B+ tree has a parameter m called the order of the tree, and each internal node of a B+ tree has between m and 2m children (except the root, which has between 2 and 2m). Find a value for the order m of the B+ tree so that one 8 kilobyte page can be assigned to each internal node and leaf, and so that an internal node will fill, but not overflow, its page when it has 2m children. Assume that each index term is represented using 8 bytes. If you need to make additional assumptions, state what assumptions you are making.
Part b For your m of Part a, what is the estimated height of the B+ tree?
Problem 2 In Problem 1, we assumed each index term was represented using 8 bytes. Clearly, many English words are longer than 8 characters, so it cannot be that the full ASCII representation is being used for each index term. Assuming the index terms are English words, what technique or techniques might be used to allow index terms to be represented in 8 bytes? For any technique you suggest, explain the costs of this technique both in preprocessing and when considering the processing of a single-word query.