You may discuss problems with other students in the class. However, each student must write up his or her own solution to each problem independently. That is, while you may formulate the solutions to problems in collaboration with classmates, you must be able to articulate the solutions on your own.
Problem 1
We have discussed in class that one of the benefits of latent
semantic
indexing is that it reduces the dimension of the document
vectors. A much
simpler way to reduce the dimension is to reduce the number of
terms. Consider the
following algorithm for choosing only K terms from among the full set
of index terms as representative terms for a collection of documents:
Part a. There
is an example of the latent semantic indexing calculation for a matrix
of 12 terms by 9 documents in the paper "Indexing
by latent semantic analysis." by Deerwester, S. et. al. whose
precision-recall results we looked at in class on 2/28/06. The
original matrix
is in Table 2 on page 10. It is presented as two groups of
documents (c1-c5 and m1-4). Execute the new algorithm described
above for K=2 on the same example. (You may do this by hand or
write a short program in the language of your choice. An ascii file of
the 0/1 entries of the 12 rows × 9 columns matrix can be found here.) What are the two terms
used?
Part b. Is the new
algorithms successful on the example of Part a? What are you using as
criteria for success? How does the result
compare
to the rank 2 latent semantic indexing representation presented in the
paper? (You will find the actual final decomposition for rank 2
LSI in the paper's Appendix and find plots of terms and documents in
the 2 dimensions
within the paper. Note that the paper uses different notation
than we used in class: they use X= TSDT to
denote the SVD and we used D=KSP, "D" playing a different role in our
notation. )
Part c. What
characteristics should a collection of document have to have the best
results from using only a subset of the index terms as
described above? Are certain values of K better than others?
Problem 2 What is the computational cost (running-time) of doing a comparison of a query to all documents after latent semantic indexing has been used to express the term-document matrix D in terms of matricies Ks, Ss, and Ps for the rank-s approximation (using our class notation)? Do not include the preprocessing cost to find matrices Ks, Ss, and Ds. You should list each step of the computation to compare a query expressed as a vector of weights for t index terms to all N documents. Your analysis should be in terms of t, N, and the reduced rank s after latent semantic indexing as been applied.
Problem 3 Consider a hybrid query denoted (t1, t2, ... tk, MUST(s1,
s2, ... sm)), where t1, ... tk and s1, ... sm are index terms. The
semantics of this
query is that only documents containing all terms s1, s2, .. sm
are considered matches to the query. These matching documents are
ranked
by considering the index terms t1, ..., tk and using a tf.idf ranking.
Note that if one wishes to use a term both for filtering and for
ranking,
the term must be listed among the si and the tj, respectively.