Distributed Queries and Incremental Updates in Information Retrieval Systems (Thesis)

Report ID: TR-458-94
Author: Tomasic, Anthony S.
Date: 1994-06-00
Pages: 157
Download Formats: |Postscript|
Abstract:

The proliferation of the world's ``information highways'' has renewed interest in efficient document indexing techniques. This thesis explores the architecture of information retrieval systems for querying and indexing documents. Distributed queries are studied with analytical and trace-driven simulations. We focus on physical index design, inverted index caching, and database scaling in a distributed system. All three issues influence response time and throughput. Incremental updates of inverted lists are studied using a new dual-structure index data structure. This index structure separates long and short inverted lists dynamically and optimizes the retrieval, update, and storage of each type of list. To study the behavior of the index, engineering trade-offs are described that favor either update time or query performance. We explore these trade-offs quantitatively by using actual data and hardware and simulation to determine the best algorithm under a variety of criteria. Finally, implementation of our incremental update algorithms is compared to an existing information retrieval system.