|
Computer Science 435
Information Retrieval, Discovery, and Delivery
Andrea
LaPaugh
|
Spring 2015
|
General Information | Schedule and
Assignments | Project Page | Announcements
EVOLVING: CHECK BACK FOR UPDATES
The reading for a class should be completed before class.
WEEK 1
Mon. Feb. 2: Overview of
course topics and organization. Models of information.
- Class presentation slides (pdf): introduction
- Reading:
- Optional Reading:
Wed. Feb 4: Foundations: classic information
retrieval of text.
- Class presentation slides (pdf): classic IR
- Reading:
- Optional
Reading:
- Assignment
0: register on Piazza
if you aren't already and add yourself to cos 435.
WEEK 2
Mon. Feb. 9: Extending
the models. Using links.
- Class presentation slides: link analysis for ranking
- Reading
- Also of interest, original papers:
- (HITS algorithm) Kleinberg, Jon, Authoritative
sources in a hyperlinked environment, Journal of the
ACM, Vol. 46, No. 5(Sept. 1999), pp.604-632.
(Earlier versions appeared in Proc. 9th ACM-SIAM Symposium on Discrete Algorithms,
1998 and as IBM Research Report RJ 10076, May 1997.)
- (PageRank algorithm) Page, Larry and Sergey Brin, R.
Motwani, T. Winograd, The PageRank Citation Ranking:
Bringing Order to the Web, Stanford
Digital Library Technologies Project TR, Jan. 1998.
(Early version: L. Page. PageRank: Bringing order to the web.
Stanford Digital Libraries Working Paper 1997-0072, Stanford
University, 1997. )
Wed. Feb. 11: Evaluation
of retrieval systems
- Assignment 1 is now available - due 2/18
- Class presentation slides (pdf): evaluation
- Reading:
- Optional Reading:
- Also of interest:
WEEK 3
Mon. Feb. 16:
Evaluation, cont.; Index structure and use.
Wed. Feb 18: Index structure and use continued
- Assignment 1 (pdf) due today!
- no new notes or required reading
- Recommended reading:
- details on B+ trees in Database Management Systems by
Raghu Ramakrishnan and Johannes Gehrke (Third
Edition, McGraw-Hill, 2003): Chapter 10, Sections 3-6
(pp. 344-356). Book on reserve in Engineering Library.
Thurs. Feb 19: assignment 2 is
now available - due Feb 25.
WEEK 4
Mon. Feb 23: Index construction
Wed. Feb 25: Index compression
Thurs. Feb 19: Assignment 3 (pdf) is now
available - due Mar. 4.
WEEK 5
Mon. Mar 2:
Index compression continued
- Notes on
compression lecture - Part 2 (pdf)
- Reading:
- Also of interest
- A. Moffat and J. Zobel, Self-
indexing inverted files for fast text retrieval, ACM
Transactions on Information Systems, Vol. 14, No. 4
(Oct. 1996), pgs 349-379.
- The Anatomy of
a Large-Scale Hypertextual Web Search Engine, (pdf from
Stanford
publications collection) Brin, Sergey and Page,
Lawrence, Proceedings of the Seventh International WWW
Conference (WWW 7), 1998.
Wed. Mar 4: Distributed computation for index
building and query execution.
- Assignment 3 due today!
- Class presentation slides (pdf): distributed computing
- Reading:
- Optional reading:
- Also of interest
- Web
Search for a Planet: The Google Cluster Architecture,
Luiz Barroso, Jeffrey Dean, and Urs Hölzle, In IEEE
Micro, Vol. 23, No. 2, pages 22-28, March, 2003.
- "Bigtable:
A
Distributed Storage System for Structured Data" ,
Fay Chang, et. al., In 7th USENIX
Symposium on Operating Systems Design and Implementation
(OSDI '06), 2006.
- MapReduce: simplified
data processing on large clusters, Jeffrey Dean and Sanjay
Ghemawat, Communications
of the ACM, 51(1),
Jan. 2008. (Special 50th Anniversary
issue: Breakthrough
research: a preview of things to come.)
- The Apache Hadoop project
WEEK 6
Mon. Mar. 9: Distributed computation for index
building (continuation of Mar. 4); Crawling the Web
- Class presentation slides (pdf): Web crawling
- Reading:
- Also of interest:
- Chakrabarti, Soumen,
Mining the Web: Discovering Knowledge from Hypertext Data, Chapter
2 and Chapter 8, section 8.3.1
Wed Mar. 11: Semi-structured
information and XML
Take-home
midterm
exam distributed Wednesday March 11, 2015 at the end
of class. Due 4:30
PM sharp Friday March 13, 2014.
Spring break
WEEK 7
Mon. March 23: Using users behavior: search refinement and recommender
systems
- Class presentation slides (pdf): search refinement
and personalization
- Reading:
- Also of interest:
- Personalizing
Web Search using Long Term Browsing History, Matthijs
and Radlinski, International Conf. on Web Search and
Data Mining
(WSDM), ACM, 2011.
- A
Large-scale Evaluation and Analysis of Personalized Search
Strategies (pdf), Song and Wen, Sixteenth Intern.World Wide Web
Conference, (WWW2007),
2007.
- Time
is of the Essence: Improving Recency Ranking Using Twitter
Data, Anlei Dong et. al., Proc. Intern. Conf. on
World Wide Web (WWW), ACM, 2010, pp.
331-340.
- The Adaptive Web, P.
Brusilovsky, A. Kobsa, W. Nejdl, eds., Lecture Notes in Computer
Science book series Vol 4321, Springer, 2007.
This book contains several relevant chapters. Chapter 6: Personalized Search on the World
Wide Web by A. Micarelli, F.Gasparetti, F.Sciarrone
and S. Gauch is of particular interest. The chapters are
available as pdf files to members of the Princeton University
community by accessing them from the princeton.edu domain.
- Google:
Bing Is Cheating, Copying Our Search Results by Danny
Sullivan, Feb 1, 2011 at 8:45am ET and
Bing
Admits Using Customer Search Data, Says Google Pulled
‘Spy-Novelesque Stunt’ by Matt
McGee, Feb 1, 2011 at 1:56pm
ET, both on Search
Engine Land.
Wed. March 25: Recommender
systems: collaborative filtering
- Assignment
4 (pdf)
is now available - due Apr. 1.
- Class presentation
slides (pdf): recommender systems and
search
- Reading:
- Also of interest:
- Matrix
Factorization Techniques for Recommender Systems, Koren,
Bell and Volinsky, IEEE Computer, 42(8), August 2009,
pp. 42-49.
- Modeling
Relationships at Multiple Scales to Improve Accuracy of
Large Recommender Systems, Bell, Koren and
Volinsky, International
Conf.
on Knowledge Discovery and Data Mining
(KDD),
ACM 2007.
- Scalable
Collaborative Filtering with Jointly Derived
Neighborhood Interpolation Weights, Bell and
Koren, IEEE International
Conference on Data Mining, 2007.
- Netflix
Awards $1 Million Prize and Starts a New Contest by
Steve Lohr, New York Times'
Bits Blog, Sept. 21 2009. (The new contest
was canceled due to privacy concerns.)
WEEK 8
Mon. March 30: Collaborative filtering,
continued; Latent
semantic indexing
Wed. April 1: Clustering
Thurs.
April 2 : Assignment 5 (pdf) is now
available - due April 8
WEEK 9
Project progress meetings with Professor
LaPaugh NEXT
week -watch Piazza
for sign-up instructions
Mon. April 6: Clustering continued
Wed. April 8: Detecting near-duplicate documents
Thurs. April 2 : Assignment 6 (pdf) is now available - due April 15
WEEK 10
Project progress meetings with Professor
LaPaugh THIS
week - see Piazza
for instructions.
Mon. April 14: Non-text retrieval: image retrieval
- Class presentation slides (pdf): non-text retrieval
- No required reading
- Also of interest - today's
material drawn from these references:
- Query
by
image
and
video
content:
the
QBIC system, Flickner, M., et.al., IEEE Computer, IEEE
Computer Society, 28(9) p23-32, Sept 1995.
- Image
Similarity Search with Compact Data Structures,
Qin Lv, Moses Charikar, and Kai Li. 13th Conf. on
Information and Knowledge Management (CIKM), ACM, Nov.
2004. (Reports part of Princeton CASS project.)
- Integrating wavelets with clustering
and indexing for effective content-based image
retrieval, E
Yildizer, AM Balci, and TN Jarada, Knowledge-Based Systems, Vol. 13, July
2012 , Elsevier, pp 55-66.
- VisualRank:
Applying PageRank to Large-Scale Image Search, Yushi
Jing and Shumeet Baluja, IEEE
Transactions on Pattern Analysis and Machine Intelligence,
30(11), p 1877 - 1890, IEEE, 2008.
- Also of interest - sites to
visit in demo
- Princeton CASS: Content-Aware SearchSystems Demos
click on VARY image
Wed. April 15: Deep Web Search
- Assignment 6 (pdf)
due
today! This is the last problem set.
- Class presentation slides (pdf): deep web search
- No required reading
- Also of interest - today's
material drawn from these references:
- Structured
Data on the Web, Michael J. Cafarella, Alon
Halevy, and Jayant Madhavan, Communications of the ACM (CACM),
Vol 54 (2) February 2011, pp 72-79.
- Harnessing the Deep
Web: Present and Future (pdf), Jayant Madhavan,
Loredana Afanasiev, Lyublena Antova, and Alon
Halevy, 4th Biennial
Conference on Innovative Data Systems Research (CIDR),
Jan. 2009.
- Web-scale
extraction
of structured data, Michael J. Cafarella, Jayant
Madhavan, and Alon Halevy, ACM SIGMOD Record, Vol. 37
(4) December 2008.
- Searching
the deep web, Alex Wright, Communications of the ACM, Vol. 51 No. 10
(Oct. 2008), pages 14-15.
- Google's
Deep-Web Crawl (pdf), Jayant Madhavan, David Ko,
Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Y.
Halevy, 34th Intern. Conf. on Very Large Data
Bases, VLDB Endowment, Aug. 2008.
- Crawling deep web entity pages, Yeye He, Dong Xin, Venkatesh Ganti, Siriram
Rajaraman, and Nirav Shah, Proc. Intern. Conf. on Web Search and Data
Mining (WSDM), ACM, 2013, pp.
355-364.
- Accessing
the deep web, Bin He, Mitesh Patel, Zhen Zhang,
and Kevin Chen-Chuan Chang, Communications of the ACM, Vol. 50 No. 5 (May
2007), pages 94-101.
- Searching
for Hidden-Web Databases (pdf), Luciano
Barbosa and Juliana Freire., Proceedings of the 8th ACM SIGMOD International
Workshop on Web and Databases (WebDB), pp. 1-6, ACM
2005. (A more recent, more complicated version of the
crawler is described at the 2007 WWW conf.)
- Towards web-scale structured web data extraction, Tomas Grigalis, Proc. Intern. Conf. on Web
Search and Data Mining (WSDM), ACM,
2013, pp. 753-757.
WEEK 11
Mon. April 20 Extracting
information from Social Networks
- Class presentation slides (pdf): social networks:
intro and search methods
- Reading
- Also of interest
- #TwitterSearch:
A
Comparison
of
Microblog
Search
and
Web Search, Jaime Teevan, Daniel Ramage and
Meredith Ringel Morris, Proc. of the Intern. Conf. on Web
Search and Data Mining (WSDM), ACM, 2011, pp.
35-44.
- What
is
Twitter,
a
Social
Network
or
a News Media?, Haewoon Kwak, Changhyun Lee, Hosung Park,
and Sue Moon, Proc. Intern.
Conf. on World Wide
Web (WWW), ACM, 2010, pp. 591-600.
- We
Feel
Fine
and
Searching
the
Emotional
Web, Sepandar D. Kamvar and Jonathan Harris, Proc.
of the Intern. Conf. on Web Search and Data Mining
(WSDM), ACM, 2011, pp. 117-126.
- We Feel
Fine: An Almanac of Human Emotion by Sep Kamvar and
Jonathan Harris, video on YouTube. See also the book We Feel Fine: An Almanac of
Human Emotion by Sep Kamvar & Jonathan Harris,
Scribner, Dec. 2009.
- Learning
from Bullying Traces in Social Media, Jun-Ming Xu,
Kwang-Sung Jun, Xiaojin Zhu and Amy Bellmore, Proc.
Conf. of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies
(NAACL HLT), Assoc. for Computational Linguistics, 2012,
pp. 656-666.
Wed. April 22 Social
Networks: aggregating information and analyzing structure
- Today's slides with exam
and project info
- Class presentation slides (pdf): aggregation and
structure analysis
- Reading on social network structures (see COS435 General Information for online access to
these books)
- Networks,
Crowds,
and Markets: Reasoning about a Highly Connected
World, (Easley, and Kleinberg, Cambridge
University Press, July 19, 2010) Chapter 3, Sections
1-5.
-
Mining
of Massive Data Sets. (Rajaraman, Anand;
Leskovec,
Jure; Ullman, Jeffrey D,
Cambridge University Press. 2011),
Chapter 10, Section 2.
- Recommended reading:
- Also of interest:
- An
Experimental Study of the Small World Problem, Jeffrey
Travers and Stanley Milgram, Sociometry, Vol. 32, No.
4, American Sociological Assoc. (Dec., 1969), pp. 425-443.
- Planetary-scale
views on a large instant-messaging network, Jure
Leskovec and Eric Horvitz,
Proc. Intern. Conf. on World Wide Web (WWW), ACM, 2008,
pp,
915-924.
WEEK 12
Mon. April 27 Privacy
Issues in Information Systems
- Class presentation slides (pdf): privacy
- No required reading
- Also of interest for privacy issues
- Netflix
Spilled
Your
Brokeback
Mountain
Secret,
Lawsuit
Claims, Ryan Singel, Wired,
De. 17, 2009.
- A
Face Is Exposed for AOL Searcher No. 4417749, Michael
Barbaro and Tom Zeller Jr., The
New York Times, August 9, 2006
- Engineering
Privacy, Sarah Spiekermann and Lorrie Vaigth Cranor, IEEE Transactions on Software
Engineering 35(1), IEEE, pp .67-82, Jan./Feb 2009.
- You Might Also Like: Privacy Risks of
Collaborative Filtering, Calandrino,
J.A, Kilzer, A., Narayanan, A., Felten, E.W., and
Shmatikov, V., IEEE
Sym. on Security and Privacy (SP), 2011, pp. 231 -
246.
- Personalization and privacy: a survey of privacy
risks and remedies in personalization-based systems, Eran Toch, Yang Wang and Lorrie
Faith Cranor, User Modeling
and User-Adapted Interaction, Vol.22 (1-2), Springer,
2012,
Wed. April 29 wrap-up
- Class presentation slides (pdf): closing
- No required reading
Second
take-home exam distributed Wednesday
April 29, 2015 at the end of class, due 5:30 PM sharp Friday May 1, 2015
Project Report due 5:00 pm Dean's
Date, Tuesday May 12, 2014
Project Demonstrations between May 13
and May 18
last revised Tue
Jun 2 14:31:36 EDT 2015
Copyright
2010,
2011, 2012, 2013, 2014, 2015 Andrea S. LaPaugh