CS 597C: Scalable Internet Services
Papers and Resources
SEARCH AND
CRAWL
Bharat, K. and Broder,A. 1998. A technique for measuring the relative size and overlap of public Web search engines. Proc. 7th Int. World Wide Web Conf., Brisbane, Australia, pp. 379–388.
Lawrence, S. and Giles, C. L. 1998b. Searching the World Wide Web. Science 280, 98–100.
Lawrence, S. and Giles, C. L. 1999a. Acccessibility of information on the Web. Nature 400, 107–109.
BrightPlanet. The Deep Web: Surfacing Hidden Value,
Chris Sherman and Gary
Price.The Invisible Web. CyberAge
Books, 2001.
Text
Indexing
Witten, I. H., Moffat, A. and Bell, T. C. 1999. Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. San Francisco, CA: Morgan Kaufmann. Inverted text indexes, including secondary storage algorithms.
Harman, D., Baeza-Yates, R., Fox, E. and Lee,W. 1992 Inverted files. In Information Retrieval, Data Structures and Algorithms (ed. W. B. Frakes and R. A. Baeza-Yates), pp. 28–43. Englewood Cliffs, NJ: Prentice Hall. Dealing with Boolean queries with multiple terms.
Moffat, A. and Zobel, J. 1996 Self-indexing inverted files for fast text retrieval. ACM Trans. Informat. Syst. 14, 349–379. Index compression methods, as well as algorithms for query evaluation and ranking.
Melnik, S., Raghavan, S.,Yang, B. and Garcia-Molina, H. Building a distributed full-text index for the Web. ACM Trans. Informat. Syst. 19, 217–241, 2001. Distributed index design for Web Documents.
Other
methods for text indexing (not so scalable or in vogue)
Manber, U. and Myers, G. 1990. Suffix arrays: a new method for on-line string searches. Proc. 1st Ann. ACM–SIAM Symp. on Discrete Algorithms, pp. 319–327. Philadelphia, PA: Society for Industrial and Applied Mathematics.
Faloutsos, C. and Christodoulakis, S. 1984. Signature files: an access method for documents and its analytical performance evaluation. ACM Trans. Informat. Syst. 2, 267–288.
Document Preparation and Tokenization
Neville-Manning, C. and Reed, T. 1996. A PostScript to plain text converter. Technical report. (Available from http://www.nzdl.org/html/prescript.html.)
Pstotext: http://research.compaq.com/SRC/virtualpaper/home.html
http://www.w3.org/Library and http://search.cpan.org/dist/HTML-parser HTML parsers
Fox, C. 1992 Lexical analysis and stoplists. In Information Retrieval: Data Structures and Algorithms (ed. W. B. Frakes and R. Baeza-Yates), ch. 7. Englewood Cliffs, NJ: Prentice Hall.
Porter, M. 1980 An algorithm for suffix stripping. Program 14, 130–137. Stemming algorithm for English. Code available at http://www.tartarus.org/~martin/PorterStemmer
Link Connectivity Indexes
Krishna Bharat, Andrei Broder, Monika R. Henzinger,
Puneet Kumar, and Suresh Venkatasubramanian.
The Connectivity Server: Fast Access to
Linkage Information on the Web. Computer Networks,
30(1--7):469--477, 1998.
Sriram Raghavan and Hector
Garcia-Molina. Representing Web Graphs.
Proceedings of the 19th IEEE International Conference on Data Engineering
(ICDE03), pages 405-416, March 2003.
K.H. Randall, R. Stata, R.G. Wickremesinghe, and J. L. Wiener. The Link Database: Fast Access to Graphs of the Web. Research Report 175, Compaq Systems Research Center, Palo Alto, CA, 2001.
PUBLISH-SUBSCRIBE
The Jabber Programmers
Guide: A Comprehensive Snapshot of Jabber
BEEP-APEX Overview: Relationship of Peer-to-Peer Communication to BEEP
BEEP
(Blocks Extensible Exchange Protocol)
On the Design of Application Protocols (Rationale for
BEEP)
Technical WhitePaper: A Framework for Next-Generation Application
Protocols
Mapping
the BEEP Core onto TCP: RFC
HTTP-based
Protocols:
Issues in
Layering Protocols on HTTP
A Response to "Issues in Layering Protocols on HTTP"
HTTP-NG Rationale (Defunct Effort)
Event Notification Scenarios Using HTTP
Event Notification Scenarios (ppt)
RVP: A Presence and Instant Messaging Protocol (IM Protocol
used in MSN Messenger)
GENA: Generalized Event Notification over HTTP (ppt)
KnowNow White Papers:
KnowNow Microserver for Windows: Programmer’s Guide
PubSub on Jabber: Internet Draft
firstRain Open PubSub Specification: Internet Draft
JMS PubSub (contained within this JMS Specification)
Content Based Routing with Elvin4
KnowNow Event Router Programmer's Guide
Design and Evaluation of a Wide-area Event Notification Service
An Efficient Multicast Protocol for Content-Based Publish-Subscribe
Systems
The JEDI Event-Based Infrastructure and its Application to the
Development of the OPSS WFMS
A Hierarchical
Proxy Architecture for Internet-Scale Event Services
Efficient
Matching for Content-based Publish/Subscribe Systems
Filtering Algorithms and Implementations for Very Fast Publish/Subscribe
Systems
Fast Forwarding for Content-Based Networking
Matching Events
in a Content-based Subscription System
Efficient Filtering of XML Documents for Selective Dissemination of
Information
Efficient Filtering of XML Documents with XPath Expressions
Efficient Filtering in Publish-Subscribe Systems using Binary Decision
Diagrams
Mesh-Based
Content Routing using XML
Supporting public availability and accessibility with Elvin: Expriences
and reflections
An Infrastructure for Meta-Auctions (.ps.gz)
Content based
routing as the basis for intra-agent communication
Hermes - A
Notification Service for Digital Libraries
Publish/Subscribe in a Mobile Environment
Using a
publish/subscribe middleware to support mobile computing
Publish-Subscribe Projects