COS
435,
Spring
2011 - Problem Set 4
Due
at 1:30pm, Wednesday, March 30, 2011
Collaboration and Reference
Policy
You may discuss the
general
methods of solving the problems with other students in the class.
However, each
student must work out the details and write up his or her own solution
to each
problem independently.
Some problems
have
been used in
previous offerings of COS 435. You are NOT allowed to use any solutions
posted
for previous offerings of COS 435 or any solutions produced by anyone
else for
the assigned problems. You may use
other reference materials; you must give citations to all reference
materials
that you use.
Lateness Policy
A late penalty will be applied,
unless there are extraordinary circumstances and/or prior arrangements:
- No penalty if in Prof. LaPaugh's office or inbox by 5pm Wednesday
(3/30/11).
- Penalized 10% of the earned score if submitted by 11:59 pm Wed
(3/30/11).
- Penalized 25% of the earned score if submitted by 5pm
Friday (4/1/11).
- Penalized 50% if submitted later than 5pm Friday
(4/1/11).
Problem 1 (Distributed computing)
Part a.
A collection of Web documents is processed and
one of the results is a set of tuples of the form: (docURL, list
of URLs), one tuple per docURL. The list of URLs is the list of
all URLs that the document identified by docURL links to, i.e. the list
of URLs appear as values of "href = ..." in the document. This is
the "to list" - the list of URLs the document points
to. We want to produce the "from list" for the collection
in the form (docURL, list of URLs), where the list of URLs is the list of all URLs that point to the
document
identified by
docURL.
Use the MapReduce programming model to map the set of tuples (docURL,
"to list" of URLs) to the set of tuples (docURL, "from list" of URLs).
Part b.
A problem with using URLs is that a document may be referred to
by several URLs, e.g. due to mirror sites or shortcuts. What we
really want is a set of tuples (docID, "from list" of docIDs), where
each document has a unique docID. Suppose that the "to list"
tuples are of the form (docID, list of URLs), built by processing
the document by docID but leaving the URLs appearing in the document as
is. Suppose there is also a set of tuples (URL, docID) giving
the unique document ID associated with each URL. Give one or more
successive MapReduce computations to build the "from list" as tuples (docID, "from list" of docIDs) from
the new "to list" and URL-docID mapping. You
may build on Part a or start over.
Notes: Slide 10 of the distributed
query
execution
and index building slides suggests that the results of
Reduce must be in terms of the intermediate key. The intermediate
key and the output key are the same for
the inverted index example, but not in general. The form of the
output is decided by the user writing the Reduce function.
Also, successive MapReduce computations need not have their inputs
restricted to the output of the previous MapReduce computation.
Problem 2 (Compression )
Part a.
Give the variable-byte encoding
for each of these integers: 60, 32774 , 230-1.
How
many bits
does each encoded integer require? How many bytes?
Part b.
Give the
Elias-gamma code for the same integers as in Part a.
How many bits does each encoded integer require? How
many bytes (no fractions of bytes!)?
Part c.
Give the
Elias-delta code for the same integers as in Part a.
How many bits does each encoded integer
require? How many bytes? Note
that Introduction
to
Information
Retrieval does not cover Elias-delta
encoding. Please use the
class notes posted under March 2
for the
definitions of Elias-gamma and Elias-delta encodings.
Part d.
Decode the following sequence of
numbers
encoded with the Elias-delta
encoding:
100111100100101101010