CS 435 schedule and assignments 2011

Princeton University
Computer Science Dept.

Computer Science 435
Information Retrieval, Discovery, and Delivery

Andrea LaPaugh

Spring 2011

General Information | Schedule and Assignments | Project Page | Announcements

EVOLVING: CHECK BACK FOR UPDATES

WEEK 1
Mon. Dec. 31: Overview of course topics and organization. Models of information.

Class presentation slides (pdf): Introduction
Reading:
- Introduction to Information Retrieval, Preface and Chapter 1.
- Introduction to Information Retrieval, Chapter 19: Introduction, Sections 1 through 4.
Also of interest:

Bush, Vannevar, "As We May Think," Atlantic Monthly, July 1945: The Atlantic html version or ACM page containing link to pd f version (with original illustrations).

Wed. Feb 2: Foundations: classic information retrieval of text. Extending the models.

Class presentation slides (pdf): classic information retrieval
Reading:

Introduction to Information Retrieval, Chapter 6, except 6.1.2, 6.1.3, and 6.4.4.

WEEK 2
Mon. Feb. 7: Ranking, classical

Continuation of material begun Feb 2.

Wed. Feb 9: Ranking, Web

Class presentation slides (pdf): link-based ranking
Problem Set 1 is now available.
Reading:

Introduction to Information Retrieval, Chapter 21.

Also of interest, original papers:

(HITS algorithm) Kleinberg, Jon, Authoritative sources in a hyperlinked environment, Journal of the ACM, Vol. 46, No. 5(Sept. 1999), pp.604-632. (Earlier versions appeared in Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998 and as IBM Research Report RJ 10076, May 1997.)
(PageRank algorithm) Page, Larry and Sergey Brin, R. Motwani, T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Stanford Digital Library Technologies Project TR, Jan. 1998. (Early version: L. Page. PageRank: Bringing order to the web. Stanford Digital Libraries Working Paper 1997-0072, Stanford University, 1997. )

WEEK 3
Mon. Feb. 14: Evaluation of retrieval systems

Class presentation slides (pdf): evaluation
Reading:

Introduction to Information Retrieval, Chapter 8

Also of interest:

Text REtrieval Conference (TREC) Proceedings

Wed. Feb. 16: Index structure and use.

Class presentation slides (pdf): index structure and use
Problem Set 1 due today!
Problem Set 2 is now available.
Reading:

Introduction to Information Retrieval, Chapter 2: Sections 3 and 4.
Introduction to Information Retrieval, Chapter 3: Introduction and Section 1.
Introduction to Information Retrieval, Chapter 7 (except 7.1.6).

WEEK 4
Mon. Feb. 21: Index construction

Class presentation slides (pdf): blow-up of B⁺ tree example and index construction
Reading:

Introduction to Information Retrieval, Chapter 2: Sections 1 and 2.
Introduction to Information Retrieval, Chapter 4, except Section 4.

Wed. Feb. 23: Index construction continued; Zipf's law.

Problem Set 2 due today!
Reading:

Introduction to Information Retrieval, Chapter 5: Sections 1 and 2.

Also of interest

Chris Anderson, The Long Tail , Wired, October 2004 (link is to updated version - Dec. 14, 2004).

Thursday Feb. 24: Problem set 3 (pdf) is now available.

WEEK 5
Mon. Feb. 28: Index compression

Project proposal due today.

Wed. Mar. 2: Index compression continued.

Problem Set 3 (pdf) due today!
Summary of compression (pdf)
brief outline of topics for midterm exam (pdf)
Reading:

Introduction to Information Retrieval, Chapter 5: Section 3
skip pointers covered in Section 2.3 assigned earlier

Also of interest

A. Moffat and J. Zobel, Self- indexing inverted files for fast text retrieval, ACM Transactions on Information Systems, Vol. 14, No. 4 (Oct. 1996), pgs 349-379.
The Anatomy of a Large-Scale Hypertextual Web Search Engine, (pdf from Stanford publications collection) Brin, Sergey and Page, Lawrence, Proceedings of the Seventh International WWW Conference (WWW 7), 1998.

WEEK 6
Mon. Mar. 7: Distributed computation for index building and query execution.

Class presentation slides (pdf): distributed query execution and index building
Reading:

Introduction to Information Retrieval, Chapter 4: Section 4.
Large-scale Incremental Processing Using Distributed Transactions and Notifications(pdf), Daniel Peng and Frank Dabek, in 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI '10), 2010.

Also of interest

Web Search for a Planet: The Google Cluster Architecture, Luiz Barroso, Jeffrey Dean, and Urs Hölzle, In IEEE Micro, Vol. 23, No. 2, pages 22-28, March, 2003.
"Bigtable: A Distributed Storage System for Structured Data" , Fay Chang, et. al., In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI '06), 2006.
MapReduce: simplified data processing on large clusters, Jeffrey Dean and Sanjay Ghemawat, Communications of the ACM, 51(1), Jan. 2008. (Special 50^thAnniversary issue: Breakthrough research: a preview of things to come.)
MapReduce: The programming model and practice (pdf of slides), Jerry Zhao, Jelena Pjesivac-Grbovic, SIGMETRICS'09 Tutorial, 2009.
The Apache Hadoop project

Wed. Mar 9: Semi-structured information and XML

Class presentation slides (pdf): semi-structured information
The XML mark-up of Hamlet was done by Jon Bosak and can be found at http://www.cafeconleche.org/examples/shakespeare.
Reading:

Introduction to Information Retrieval, Chapter 10

Also of interest

An XQuery Sandbox example tool that uses XML marked-up Shakespeare plays can be found on the eXist Project Web site. The eXist Project is centered around eXist-db, which is (in their words) "an open source database management system entirely built on XML technology."

take-home EXAM 1: DISTRIBUTED end of class Wednesday March 9. DUE 3:00 PM Friday, March 11.

Spring break

WEEK 7
Mon. March 21: canceled

Wed. March 23: Search refinement; using users behavior

Class presentation slides (pdf): search refinement and recommendation methods
Reading:

Introduction to Information Retrieval, Chapter 9
An Analytical Comparison of Approaches to Personalizing PageRank (pdf), Sep Kamvar, Taher Haveliwala and Glen Jeh, Stanford University Technical Report. June, 2003.

Also of interest:

Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions, Gediminas Adomavicius, Alexander Tuzhilin; IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734-749, June 2005.
The Adaptive Web, P. Brusilovsky, A. Kobsa, W. Nejdl, eds., Lecture Notes in Computer Science book series Vol 4321, Springer, 2007. This book contains several relevant chapters. Chapter 6: Personalized Search on the World Wide Web by A. Micarelli, F.Gasparetti, F.Sciarrone and S. Gauch is of particular interest. The chapters are available as pdf files to members of the Princeton University community by accessing them from the princeton.edu domain.
Google: Bing Is Cheating, Copying Our Search Results by Danny Sullivan, Feb 1, 2011 at 8:45am ET and
Bing Admits Using Customer Search Data, Says Google Pulled ‘Spy-Novelesque Stunt’ by Matt McGee, Feb 1, 2011 at 1:56pm ET, both on Search Engine Land.
Netflix Awards $1 Million Prize and Starts a New Contest by Steve Lohr, New York Times' Bits Blog, Sept. 21 2009. (The new contest was canceled due to privacy concerns.)

Thursday March 24: Problem set 4 is now available.

WEEK 8
Mon. March 28: Clustering

Class presentation slides (pdf): clustering: intro and K-means algorithm
Reading:

Introduction to Information Retrieval, Chapter 16, Introduction and Sections 16.1, 16.2, and 16.4.
Introduction to Information Retrieval, Chapter 17, Introduction and Sections 17.1, 17.2, and 17.6.

Wed. March 30: Clustering continued

Problem Set 4 due today!
Problem Set 5 (pdf) is now available.
Class presentation slides (pdf): clustering: general algorithms
Reading:

Introduction to Information Retrieval, Chapter 17, Sections 17.3, 17.4, and 17.8.

Also of interest:

Introduction to Information Retrieval, Chapter 16, Section 6.3 is recommended if you are going to read research papers on clustering. We will touch on external evaluation criteria very briefly.
Introduction to Information Retrieval, Chapter 17, Sections 17.5 and 17.7 are recommended but not required.

WEEK 9
Project progress reports this week or next - meet with Professor LaPaugh.

Mon. April 4: Latent Semantic Indexing

Class presentation slides (pdf): latent semantic indexing.
Reading:

Introduction to Information Retrieval Chapter 18 (note that section 18.1 is helpful background but not absolutely necessary).

Of interest:

References to Papers on LSI from Telcordia Technologies, where LSI was first developed. Includes link to Deerwester, Dumais et. al.

Wed. April 6: Detecting near-duplicate documents

Problem Set 5 (pdf) due today!
Class presentation slides (pdf): near-duplicate document detection
Reading:

Introduction to Information Retrieval, Chapter 19, Section 19.6

Thurs. April 7: Problem Set 6 (pdf) is available.

WEEK 10
Project progress reports this week - meet with Professor LaPaugh unless you met last week.

Mon. April 11: Crawling the Web

Class presentation slides (pdf): web crawling
Reading:

Introduction to Information Retrieval, Chapter 20, Sections 20.1-20.3.

Also of interest:

*Chakrabarti, Soumen, Mining the Web: Discovering Knowledge from Hypertext Data, Chapter 2 and Chapter 8, section 8.3.1
Intelligent Crawling On the World Wide Web with Arbitrary Predicates," Aggarwal, Al-Garawi, and Yu, Tenth International World Wide Web Conference (WWW10), 2001.
Evaluating topic-driven web crawlers, Filippo Menczer, Gautam Pant, Padmini Srinivasan, Miguel E. Ruiz, Proc. Intern.ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR Conf.), ACM, 2001, pages: 241 - 249.

Wed. April 13: Focused crawling; Characteristics of the changing Web

Problem Set 6 (pdf) due today!
Class presentation slides (pdf): Web characteristics (Focused crawling with slides posted 4/11)
Also of interest - papers summarized in the presentation on characteristics of the Web:

The Web Changes Everything: Understanding the Dynamics of Web Content, E. Adar, J. Teevan, S.T. Dumais and J. L. Elsas, Intern. Conf. on Web Search and Data Mining (WSDM), ACM, 2009, pgs 282-291.
Recrawl Scheduling Based on Information Longevity, Christopher Olston and Sandeep Pandey, Intern. World Wide Web Conf.(WWW), 2008. (pdf here)
Changes in Webpage Structure over Time (pdf via ftp), M. Dontcheva, S. M. Drucker, D.Salesin, M. F. Cohen, , U. Washington CSE Technical Report (TR2007-04-02), April 2007.
What's New on the Web? The Evolution of the Web from a Search Engine Perspective, A. Ntoulas, J. Cho, and C. Olston, Intern. World Wide Web Conf.(WWW), ACM, 2004.
A large-scale study of the evolution of Web pages, D. Fetterly, M. Manasse, M. Najork and J. L. Wiener, Software: Practice and Experience, 34:213–237 (2004) Wiley.
Estimating the Change of Web Pages, Sung Jin Kim and Sang Ho Lee, Intern. Conf. Computational Science (ICCS), Springer, 2007.

WEEK 11
Mon. April 18: canceled

Wed. April 20: Overview exam topics; Non-text retrieval: image retrieval

Class presentation slides (pdf): exam topics
Class presentation slides (pdf): non-text retrieval
Also of interest - papers summarized in the presentation on image retrieval:

Query by image and video content: the QBIC system, Flickner, M., et.al., IEEE Computer, IEEE Computer Society, 28(9) p23-32, Sept 1995.
Image Similarity Search with Compact Data Structures, Qin Lv, Moses Charikar, and Kai Li. 13th Conf. on Information and Knowledge Management (CIKM), ACM, Nov. 2004. (Reports part of Princeton CASS project.)
VisualRank: Applying PageRank to Large-Scale Image Search, Yushi Jing and Shumeet Baluja, IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), p 1877 - 1890, IEEE, 2008.

Also of interest - sites visited in demo

Princeton CASS: Content-Aware SearchSystems Demos click on VARY image

Idee Labs, click on Visual Search Lab

Tiltomo
Google images "similar" option after doing term-based search

WEEK 12
Take-home EXAM 2: DISTRIBUTED end of class Monday, April 25. DUE beginning of class Wednesday, April 27.

Mon. April 25: Issues in searching the modern Web: deep Web

Class presentation slides (pdf): deep Web search
Also of interest - references for today:

Harnessing the Deep Web: Present and Future (pdf), Jayant Madhavan, Loredana Afanasiev, Lyublena Antova, and Alon Halevy, 4th Biennial Conference on Innovative Data Systems Research (CIDR), Jan. 2009.
Searching the deep web, Alex Wright, Communications of the ACM, Vol. 51 No. 10 (Oct. 2008), pages 14-15.
Google's Deep-Web Crawl (pdf), Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Y. Halevy, 34^thIntern. Conf. on Very Large Data Bases, VLDB Endowment, Aug. 2008.
Accessing the deep web, Bin He, Mitesh Patel, Zhen Zhang, and Kevin Chen-Chuan Chang, Communications of the ACM, Vol. 50 No. 5 (May 2007), pages 94-101.
Searching for Hidden-Web Databases, Luciano Barbosa and Juliana Freire., Proceedings of the 8th ACM SIGMOD International Workshop on Web and Databases (WebDB), pp. 1-6, ACM 2005. (A more recent, more complicated version of the crawler is described at the 2007 WWW conf.)

Wed. April 27: Wrap-up

Class presentation slides (pdf): themes and future
Also of interest

The Semantic Web, Tim Berners-Lee, James Hendler and Ora Lassila, Scientific American 284(5), May 2001, p. 34-43. (Scientific American is available online through the Princeton University Library.)

In class we saw two illustrations of an OWL ontology; they are from Electronics and Telecommunications Research Institute of Korea ezOWL project, a Semantic Web Ontology Editor.
WWW consortium (W3C) Semantic Web home page
W3C statement of goals for and activities of the Semantic Web initiative
W3C overview of Semantic Web technologies

Supplemental reading: a selection of current research papers on information retrieval, discovery, and delivery

Assessing the scenic route: measuring the value of search trails in web logs, Ryen W. White and Jeff Huang, Proceedings of the 33rd International ACM SIGIR conference on Research and development in information retrieval (SIGIR '10), ACM, July 2010, pp. 587-594.
SourceRank: Relevance and Trust Assessment for Deep Web Sources Based on Inter-Source Agreement (pdf) Raju Balakrishnan and Subbarao Kambhampati, Proceedings of the 20th World Wide Web Conference (WWW '11), April 2011, pp 227-236.
Search Result Diversity for Informational Queries (pdf) Michael J. Welch, Junghoo Cho, Christopher Olston, Proceedings of the 20th World Wide Web Conference (WWW '11), April 2011, pp 237-246.
Inverted Index Compression via Online Document Routing (pdf), Lavee, Lemple, Liberty and Somekh, Proceedings of the 20th World Wide Web Conference (WWW '11), April 2011, pp 287-295.

* on reserve in the Engineering Library

Princeton University Computer Science Dept.

Computer Science 435 Information Retrieval, Discovery, and Delivery Andrea LaPaugh

Spring 2011

EVOLVING: CHECK BACK FOR UPDATES

Princeton University
Computer Science Dept.

Computer Science 435
Information Retrieval, Discovery, and Delivery

Andrea LaPaugh