You may discuss the general methods of solving the problems with other
students in the class. However,
each student must work out the details and write up his or her own
solution to each problem
independently.
Some problems have been used in previous offerings of COS 435. You
are NOT allowed to use any solutions posted for previous offerings of
COS 435 or any solutions produced by anyone else for the assigned
problems. You may use other reference materials; you must
give citations to all reference materials that you use.
Problem 1
In the vector model, the document vectors are determined at the time
the documents are initially processed, just as the inverted index
is built at the time the documents are processed. In fact,
the set of vectors for the documents can be concisely stored in an
inverted index.
Part a.
What does the inverted index look like if the only information stored
is that found in the document vectors? Be precise.
Part b.
We discussed many factors that could be considered in ranking documents
with respect to a query. Name two or more properties
of document contents that could be used with our "enhanced
document model"
(slides 27 and 28 of classic
information
retrieval), but not in the
vector model. Explain why. (Clarification:
"containing the query term" and "query term appears in title" are
examples of properties of document contents. Both of these can be
used in the vector model.)
Problem 2
In
class we
discussed ways one might add ranking to the Boolean model
of queries. In this problem, we consider in more detail combining
Boolean queries with a scoring function for ranking. In the
combined model, a document will still only satisfy a query if it
evaluates to "true". Consider only Boolean queries in
which NOT is used only to
negate terms, not longer expressions. That is, (NOT("cat") OR
NOT("dog")) is fine, but NOT("cat" AND "dog")
is
not.
(Aside beyond scope of this course:
DeMorgan's Law tells us we lose no expressive power by doing this.)
Part a:
One way to assign scores to the satisfying documents for a Boolean
query that meets our condition
on the use of NOT is to add up the
frequencies in the documents of all terms that appear in the Boolean
query in un-negated form, even for those terms that also appear in
negated form. What is the result of applying this scoring method
to a collection of documents for the query (("cat" AND "dog")
OR (NOT("cat") AND
NOT("dog")))? (Terms "cat" and "dog"
appear in both negated and un-negated form.) What problem
or problems do you find with the result?
Part b:
Propose a specific
method of scoring documents
that tries to deal with the problem(s) you found in Part a.
Your method should
apply
to any Boolean query, i.e. any
Boolean combination of terms, that meets our
condition on the use of NOT. Assume the
"bag of words" model for documents, but you need not actually make use
of the term frequencies. Specify your scoring technique as a
formula or as an
algorithm with enough detail that
someone else can execute it (do not give implementation details).
Give intuitive justification for your method. Do a small example
of
your choosing to
illustrate your
technique. Note that the intent is not for you to spend hours
working up a complicated algorithm with a complicated justification,
but rather to propose an idea and explain it in enough detail that we
understand what you are proposing.
Problem 3
Consider a
collection of 25,000 documents. Each
document is assigned a non-empty ordered list of at
least 1 and at most
10 keywords by the authors of the document. The
keywords
come
from a fixed list of 100 keywords given to
the authors by the publisher. The order of the keywords in the list of
keywords
for a document reflects the order of importance of the keywords in the
document: the first keyword is the most important keyword, although it
could be
tied in importance with the second keyword, etc. (Imagine
the
ACM
giving authors of all technical papers a
list of 100 keywords related to computer science topics and asking each
author
to choose and order by importance at least 1 and at most 10 keywords
that best
describe the topic of the paper.)
A query on
the collection of documents is an
arbitrarily long list of keywords from among the 100 keywords. Any scoring of documents with respect
to a query should have the following properties:
Note that the
list of keywords for a query is not
ordered by importance.
Part
a
Propose a
100-dimensional vector model for
documents and queries that tries to capture the scoring properties
given above. What are the components wi,d
of a
document
vector d?
Of a query vector? What is
the similarity (or distance) function between document vectors and
query
vectors for your model?
Part
b
Show the
vectors for the following 3 documents
and 2 queries and the score for each document with respect to each
query:
document A
has keyword
list “agents, interfaces”
document B
has keyword
list “network, protocol, interfaces, instrumentation”
document C
has keyword
list “interfaces, proxy, agents”
query 1: interfaces
query 2:
interfaces,
agents
Part
c
How well does
your vector model capture the
scoring properties given at the beginning of this problem description? Be specific.
Part d
Does your vector model scoring distinguish between different documents
that contain the same number of the query keywords, but different query
keywords, for a multi-keyword query? Explain.
Part e
Does your vector model scale to full-text documents and a large
lexicon of terms? Explain. Do you think it is
practical?