|
Computer Science 435
Information Retrieval, Discovery, and Delivery
Andrea
LaPaugh
|
Spring 2016
|
General Information | Schedule and
Assignments | Project Page | Announcements
EVOLVING: CHECK BACK FOR UPDATES
The reading for a class should be completed before class.
WEEK 1
Mon. Feb. 1: Overview of
course topics and organization. Models of information.
Wed. Feb 3: Foundations: classic information
retrieval of text.
- Assignment
0: register on Piazza
if you aren't already and add yourself to cos 435.
- Assignment 1, Part 1 - due 2/10
- Class presentation slides (pdf): classic IR, part 2
- Reading:
- Optional
Reading:
WEEK 2
Mon. Feb. 8: Extending
the models. Using links.
- Class presentation slides: link analysis for ranking
- Reading
- Also of interest, original papers:
- (HITS algorithm) Kleinberg, Jon, Authoritative
sources in a hyperlinked environment, Journal of the
ACM, Vol. 46, No. 5(Sept. 1999), pp.604-632.
(Earlier versions appeared in Proc. 9th ACM-SIAM Symposium on Discrete Algorithms,
1998 and as IBM Research Report RJ 10076, May 1997.)
- (PageRank algorithm) Page, Larry and Sergey Brin, R.
Motwani, T. Winograd, The PageRank Citation Ranking:
Bringing Order to the Web, Stanford
Digital Library Technologies Project TR, Jan. 1998.
(Early version: L. Page. PageRank: Bringing order to the web.
Stanford Digital Libraries Working Paper 1997-0072, Stanford
University, 1997. )
Wed. Feb. 10: Evaluation
of retrieval systems
- Assignment
1, Part 2 - due Wed. 2/17
- Class presentation slides (pdf): evaluation (final)
- Reading:
- Optional Reading:
- Also of interest:
WEEK 3
Mon. Feb. 15:
Index structure and use.
Wed. Feb 17: Index structure and use continued;
index construction
Assignment 2 available as of Thurs
2/18, due Wed. 2/24.
- Class presentation slides (pdf):
building and storing
the index
- Reading:
- Optional Reading:
- Recommended reading:
- details on B+ trees in Database Management Systems by
Raghu Ramakrishnan and Johannes Gehrke (Third
Edition, McGraw-Hill, 2003): Chapter 10, Sections 3-6
(pp. 344-356). Book on reserve in Engineering Library.
WEEK 4
Mon. Feb 22: Index compression
Wed. Feb 24: Index
compression continued
Assignment
3 available as of Thurs 2/25, due
Wed. 3/2.
- class notes: Compression,
Part 2 (pdf)
- Reading:
- Also of interest
- A. Moffat and J. Zobel, Self-
indexing inverted files for fast text retrieval, ACM
Transactions on Information Systems, Vol. 14, No. 4
(Oct. 1996), pgs 349-379.
- The Anatomy of
a Large-Scale Hypertextual Web Search Engine, (pdf from
Stanford
publications collection) Brin, Sergey and Page,
Lawrence, Proceedings of the Seventh International WWW
Conference (WWW 7), 1998.
WEEK 5
Mon. Feb. 29: Distributed computation for index
building and query execution.
- Class presentation slides (pdf): distributed computing
- Reading:
- Optional reading:
- Also of interest
- Web
Search for a Planet: The Google Cluster Architecture,
Luiz Barroso, Jeffrey Dean, and Urs Hölzle, In IEEE
Micro, Vol. 23, No. 2, pages 22-28, March, 2003.
- "Bigtable:
A
Distributed Storage System for Structured Data" ,
Fay Chang, et. al., In 7th USENIX
Symposium on Operating Systems Design and Implementation
(OSDI '06), 2006.
- MapReduce: simplified
data processing on large clusters, Jeffrey Dean and Sanjay
Ghemawat, Communications
of the ACM, 51(1),
Jan. 2008. (Special 50th Anniversary
issue: Breakthrough
research: a preview of things to come.)
- The Apache Hadoop project
Wed. Mar. 2: Crawling the Web
- Class presentation slides (pdf): Web crawling
- Reading:
- Also of interest:
- Chakrabarti, Soumen,
Mining the Web: Discovering Knowledge from Hypertext Data,
Elsevier (Morgan_Kaufmann Division), 2003. Chapter 2 and
Chapter 8, section 8.3.1
WEEK 6
Mon. Mar. 7:
Detecting near-duplicate documents
- Class presentation slides (pdf): near-duplicate documents
- Reading:
- Also of interest:
- Henzinger, M., Finding
Near-Duplicate Web Pages: A Large-Scale Evaluation of
Algorithms,
Conf. on Research and Development in Information
Retrieval (SIGIR), 2006.
- Manku, G. S., Jain, A., Das Sarma, A., Detecting
Near-Duplicates for Web Crawling, Intern. World
Wide Web Conf. (WWW), 2007.
Wed Mar. 9: finish up near-duplicate
documents
- No new slides
- No new reading
IMPORTANT DEADLINES:
- Take-home
midterm
exam distributed today, March 9, 2016 at the
end of class. Due Friday
March 11, 2016 (specific time
due will be announced shortly, but no earlier than 3pm).
Spring break
WEEK 7
Mon. March 21: Using users behavior: search refinement and recommender
systems
- Class presentation slides (pdf): search
refinement and personalization
- Reading:
- Also of interest:
- Personalizing
Web Search using Long Term Browsing History, Matthijs
and Radlinski, International Conf. on Web Search and
Data Mining
(WSDM), ACM, 2011.
- A
Large-scale Evaluation and Analysis of Personalized Search
Strategies (pdf), Song and Wen, Sixteenth Intern.World Wide Web
Conference, (WWW2007),
2007.
- Time
is of the Essence: Improving Recency Ranking Using Twitter
Data, Anlei Dong et. al., Proc. Intern. Conf. on
World Wide Web (WWW), ACM, 2010, pp.
331-340.
- The Adaptive Web, P.
Brusilovsky, A. Kobsa, W. Nejdl, eds., Lecture Notes in Computer
Science book series Vol 4321, Springer, 2007.
This book contains several relevant chapters. Chapter 6: Personalized Search on the World
Wide Web by A. Micarelli, F.Gasparetti, F.Sciarrone
and S. Gauch is of particular interest. The chapters are
available as pdf files to members of the Princeton University
community by accessing them from the princeton.edu domain.
- Google:
Bing Is Cheating, Copying Our Search Results by Danny
Sullivan, Feb 1, 2011 at 8:45am ET and
Bing
Admits Using Customer Search Data, Says Google Pulled
‘Spy-Novelesque Stunt’ by Matt
McGee, Feb 1, 2011 at 1:56pm
ET, both on Search
Engine Land.
Wed. March 23: Recommender
systems: collaborative filtering
Assignment 4
available as of Thurs 3/24, due
Wed. 3/30.
- Class presentation slides (pdf): recommender
systems and search
- Reading:
- Also of interest:
- Matrix
Factorization Techniques for Recommender Systems, Koren,
Bell and Volinsky, IEEE Computer, 42(8), August 2009,
pp. 42-49.
- Modeling
Relationships at Multiple Scales to Improve Accuracy of
Large Recommender Systems, Bell, Koren and
Volinsky, International
Conf.
on Knowledge Discovery and Data Mining
(KDD),
ACM 2007.
- Scalable
Collaborative Filtering with Jointly Derived
Neighborhood Interpolation Weights, Bell and
Koren, IEEE International
Conference on Data Mining, 2007.
- Netflix
Awards $1 Million Prize and Starts a New Contest by
Steve Lohr, New York Times'
Bits Blog, Sept. 21 2009. (The new contest
was canceled due to privacy concerns.)
WEEK 8
Mon. March 28: Collaborative
filtering, continued; Latent semantic indexing
Wed. March 30: Clustering
WEEK 9
Project progress meetings with Professor
LaPaugh NEXT
week -watch Piazza
for sign-up instructions
Mon. April 4: Clustering continued
Wed. April 6: Social Networks: searching
- Problem Set 6 available
as of Thurs 4/7, due April
13! This is the last
problem set.
- Class presentation slides: social networks Part
1 (pdf):
- Reading
- Also of interest
- #TwitterSearch:
A
Comparison
of
Microblog
Search
and
Web Search, Jaime Teevan, Daniel Ramage and
Meredith Ringel Morris, Proc. of the Intern. Conf. on Web
Search and Data Mining (WSDM), ACM, 2011, pp.
35-44.
- What
is
Twitter,
a
Social
Network
or
a News Media?, Haewoon Kwak, Changhyun Lee, Hosung Park,
and Sue Moon, Proc. Intern.
Conf. on World Wide
Web (WWW), ACM, 2010, pp. 591-600.
- We
Feel
Fine
and
Searching
the
Emotional
Web, Sepandar D. Kamvar and Jonathan Harris, Proc.
of the Intern. Conf. on Web Search and Data Mining
(WSDM), ACM, 2011, pp. 117-126.
- We Feel
Fine: An Almanac of Human Emotion by Sep Kamvar and
Jonathan Harris, video on YouTube. See also the book We Feel Fine: An Almanac of
Human Emotion by Sep Kamvar & Jonathan Harris,
Scribner, Dec. 2009.
- Learning
from Bullying Traces in Social Media, Jun-Ming Xu,
Kwang-Sung Jun, Xiaojin Zhu and Amy Bellmore, Proc.
Conf. of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies
(NAACL HLT), Assoc. for Computational Linguistics, 2012,
pp. 656-666.
WEEK 10
Project progress meetings with Professor
LaPaugh THIS
week - see Piazza for sign-up
instructions
Mon. April 11
Social Networks: aggregating information and analyzing structure
- Class presentation slides (pdf): aggregation and
structure analysis
- Reading on social network structures (see COS435 General Information for online access to
these books)
- Networks,
Crowds,
and Markets: Reasoning about a Highly Connected
World, (Easley, and Kleinberg, Cambridge
University Press, July 19, 2010) Chapter 3, Sections
1-5.
- Mining
of Massive Data Sets. (Rajaraman, Anand;
Leskovec,
Jure; Ullman, Jeffrey D, Cambridge University Press. 2011),
Chapter 10, Section 2.
- Recommended reading:
- Also of interest:
- An
Experimental Study of the Small World Problem, Jeffrey
Travers and Stanley Milgram, Sociometry, Vol. 32, No.
4, American Sociological Assoc. (Dec., 1969), pp. 425-443.
- Planetary-scale
views on a large instant-messaging network, Jure
Leskovec and Eric Horvitz, Proc. Intern. Conf. on World Wide
Web (WWW), ACM, 2008, pp, 915-924.
Wed. April 14: Non-text retrieval: image retrieval
- Class presentation slides (pdf): non-text retrieval
- No required reading
- Also of interest - today's
material drawn from these references:
- Query
by
image
and
video
content:
the
QBIC system, Flickner, M., et.al., IEEE Computer, IEEE
Computer Society, 28(9) p23-32, Sept 1995.
- Image
Similarity Search with Compact Data Structures,
Qin Lv, Moses Charikar, and Kai Li. 13th Conf. on
Information and Knowledge Management (CIKM), ACM, Nov.
2004. (Reports part of Princeton CASS project.)
- Integrating wavelets with clustering
and indexing for effective content-based image
retrieval, E
Yildizer, AM Balci, and TN Jarada, Knowledge-Based Systems, Vol. 13, July
2012 , Elsevier, pp 55-66.
- VisualRank:
Applying PageRank to Large-Scale Image Search, Yushi
Jing and Shumeet Baluja, IEEE
Transactions on Pattern Analysis and Machine Intelligence,
30(11), p 1877 - 1890, IEEE, 2008.
- Also of interest - sites to
visit in demo
- Princeton CASS: Content-Aware SearchSystems Demos
click on VARY image
WEEK 11
Mon. April 18: Semi-structured information and
XML
Wed. April 20: Deep Web Search
- Class presentation slides (pdf): deep web search
- No required reading
- Also of interest - today's
material drawn from these references:
- Structured
Data on the Web, Michael J. Cafarella, Alon
Halevy, and Jayant Madhavan, Communications of the ACM (CACM),
Vol 54 (2) February 2011, pp 72-79.
- Harnessing the Deep
Web: Present and Future (pdf), Jayant Madhavan,
Loredana Afanasiev, Lyublena Antova, and Alon
Halevy, 4th Biennial
Conference on Innovative Data Systems Research (CIDR),
Jan. 2009.
- Web-scale
extraction
of structured data, Michael J. Cafarella, Jayant
Madhavan, and Alon Halevy, ACM SIGMOD Record, Vol. 37
(4) December 2008.
- Searching
the deep web, Alex Wright, Communications of the ACM, Vol. 51 No. 10
(Oct. 2008), pages 14-15.
- Google's
Deep-Web Crawl (pdf), Jayant Madhavan, David Ko,
Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Y.
Halevy, 34th Intern. Conf. on Very Large Data
Bases, VLDB Endowment, Aug. 2008.
- Crawling deep web entity pages, Yeye He, Dong Xin, Venkatesh Ganti, Siriram
Rajaraman, and Nirav Shah, Proc. Intern. Conf. on Web Search and Data
Mining (WSDM), ACM, 2013, pp.
355-364.
- Accessing
the deep web, Bin He, Mitesh Patel, Zhen Zhang,
and Kevin Chen-Chuan Chang, Communications of the ACM, Vol. 50 No. 5 (May
2007), pages 94-101.
- Searching
for Hidden-Web Databases (pdf), Luciano
Barbosa and Juliana Freire., Proceedings of the 8th ACM SIGMOD International
Workshop on Web and Databases (WebDB), pp. 1-6, ACM
2005. (A more recent, more complicated version of the
crawler is described at the 2007 WWW conf.)
- Towards web-scale structured web data extraction, Tomas Grigalis, Proc. Intern. Conf. on Web
Search and Data Mining (WSDM), ACM,
2013, pp. 753-757.
WEEK 12
Mon. April 25:
Visualizing Information; Privacy Issues in Information Systems
- Class presentation slides (pdf): visualization and
privacy
- No required reading
- Also of interest for visualization - present overview of
this material today:
- Also of interest for privacy issues
- Netflix
Spilled
Your
Brokeback
Mountain
Secret,
Lawsuit
Claims, Ryan Singel, Wired,
De. 17, 2009.
- A
Face Is Exposed for AOL Searcher No. 4417749, Michael
Barbaro and Tom Zeller Jr., The
New York Times, August 9, 2006
- Engineering
Privacy, Sarah Spiekermann and Lorrie Vaigth Cranor, IEEE Transactions on Software
Engineering 35(1), IEEE, pp .67-82, Jan./Feb 2009.
- You Might Also Like: Privacy Risks of
Collaborative Filtering, Calandrino,
J.A, Kilzer, A., Narayanan, A., Felten, E.W., and
Shmatikov, V., IEEE
Sym. on Security and Privacy (SP), 2011, pp. 231 -
246.
- Personalization and privacy: a survey of privacy
risks and remedies in personalization-based systems, Eran Toch, Yang Wang and Lorrie
Faith Cranor, User Modeling
and User-Adapted Interaction, Vol.22 (1-2), Springer,
2012,
Wed. April 27: wrap-up
- Class presentation slides (pdf): closing
- No required reading
Second
take-home exam distributed Wednesday
April 27, 2016 at the end of class,
due 3:30 PM sharp Friday April 29, 2016
last revised Tue
May 24 15:44:44 EDT 2016
Copyright
2010,
2011, 2012, 2013, 2014, 2015, 2016 Andrea S. LaPaugh