You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.
Problem 1
In this problem you will compute hubs and authority weights and
pageranks
for the following graph:
|------<------^That is, there are edges from Node 1 to Node 2, from Node 2 to Node 3, from Node 3 to Node 4, from Node 4 to Node 2 and from Node 4 to Node 3.
V |
Node_1 ---> Node_2 ---> Node_3 ---> Node_4-|
^ |
| |
|------------<------------
Both hub and authority weights and pageranks are obtained from
iterative
formulae. You may write a small program to calculate them, use a math
computation package such as Matlab, or do the calculation by hand --
whatever
you find easiest. If possible, give the values after each iteration to
show how the values converge. If you do the calculations
by hand, you need to do enough iterations to see that the values are
converging; however, I do not ask that you do more than 5 iterations.
If you do the calculations by computer, say what convergence criteria
you
are using. Be sure to read the
notes below on the algorithms be for beginning.
Specifically compute:
Notes on the HITS (hubs and authorities) algorithm:
Use the algorithm I gave in class as pseudo-code: for a
the vector of authority values, h the vector of hub
values, and E the adjacency
matrix:
Problem 2
Consider an inverted index containing, for each term, the posting
list (i.e. the list of documents and occurrences within documents) for
that term. The posting lists are accessed
through a B+ tree with the terms serving as search keys. Each leaf of the B+ tree holds a
sublist of alphabetically consecutive terms, and, with each term,
a pointer to the posting list for that term.
Part a. An artificially small example of a B+ tree is shown here (pdf). (Note only part of the tree is shown in detail.) What nodes of the example B+ tree are visited to find the posting list for "dune"?
Part b. Suppose there are 2 million terms for a collection of 32 million documents of total size 200 gigabytes. We would like each internal node of the B+ tree and each leaf of the B+ tree to fit in one 8-kilobyte page of the file system. Recall that a B+ tree has a parameter m called the order of the tree, and each internal node of a B+ tree has between m+1 and 2m+1 children (except the root, which has between 2 and 2m+1). Assume that each term is represented using 16 bytes, and each pointer to a child in the tree or to a posting list is represented using 8 bytes. Find a value for the order m of the B+ tree so that one 8 kilobyte page can be assigned to each internal node and leaf, and so that an internal node will fill, but not overflow, its page when it has 2m+1 children. If you need to make additional assumptions, state what assumptions you are making.
Part c. For your m of Part b, estimate the height of the B+ tree. (Giving a range of heights is fine.) Also estimate the amount of memory needed to store the tree, including leaves but not including the posting lists themselves.
Part d. Estimate the
aggregate size of the posting lists.
Problem 3
Part a. Give the
Elias-delta code for each of these integers: 60, 600, 4100.
How many bits does each encoded integer require?
Part b. Give the Golumb
code with b=256 for the same integers as in Part a. How many bits
does each encoded integer require?
Part c. Is one code
preferable if you are encoding a sequence of gaps, where the gaps have
average size of 600 and range from 60 to 1200? What if the
gaps have average size 600 and range from 6 to 6000?