COS 425, Spring 2001: Project overview

Princeton University
Computer Science Dept.

Computer Science 425
Database Systems

Andrea LaPaugh

Spring 2001

Project Overview

The goal of the course project is to have you implement and evaluate some aspect of a database management system. The particular task you will implement is query optimization. You must implement software that, given a query and a catalog telling what indexes are available for each relation, constructs a plan for computing the results of the query represented by the parse tree. It is highly recommended that you do the project in teams of 2 students, submitting one report for the team, but you are allowed to submit individual projects. The project will consist of a core part and open-ended extensions. Everyone must do the core. Whether you choose to go further, and in what way, is your choice ("you" being each pair of students doing a project together). Obviously, you will be graded both on how well you accomplish the core project and what extensions you do. Thus there is an aspect of quantity to the grade, but not at the expense of quality. Keep in mind that evaluation is an important part of this project. Implementing lots of functions without evaluating the software is not desirable.

The Core

The core of the project is to implement and evaluate the left-deep plan optimization algorithm presented in section 14.4.2 of your text (Ramakrishnan and Gehrke) but limited to queries that consist only of a collection of joins with equality conditions (equijoins) -- one equality per join. The algorithm must have access to the following data, normally found in database catalogs: the set of relations and the set of indexes available for the relations (type of index and attribute or attributes serving as search key for the index). For each relation, the size in records and in pages is given. For the core of the project, the only types of index you need to consider are B+ trees with one attribute as search key and hash functions with one attribute as search key. For each hash index, the following parameters are specified: the number of distinct search key values, the minimum and maximum present search key values, and the number of pages for the index. For each B+ tree index, the following parameters are specified: the number of distinct search key values, the minimum and maximum present search key values, the number of non-leaf levels, and the number of leaf pages for the index. (See Section 13.2.1 of the text for a discussion of these parameters.) Indices will also be either Alternative 1 or 2, and either clustered or unclustered. The input to the optimizer is the list of joins making up the query. To calculate the performance estimates for different join algorithms, you will also need the number of buffers available in main memory.

Evaluation of Core Algorithm

Not only must you implement the "core" algorithm described above, but you must evaluate your implementation. As always, the major criteria are correctness and efficiency. You are required to design inputs, both catalog information and queries, to test your design. How do you determine that your software is correctly optimizing the query? What situations stress the efficiency of your algorithm? How many joins can you optimize in a reasonable amount of time? (Reasonable, of course, depends on how many disk page I/O's you save by doing the optimization, but let us assume fairly substantially sized relations -- in the 100's to 1000's of pages. Note that even 100 pages for each of two relations gives us a join that takes more than a minute if done inefficiently.)

Presentation

You are required to submit a report that describes the implementation of your algorithm, how you tested it and the test results. Your code should be in an appendix. To give you the greatest flexibility, you can write in any language you choose. You may also use any input method you like and any data structures you like. We will supply some sample data in a simple text file format that will be easy to read in by your programi, but you are not bound to use that format for input: you may convert the samples to your own format. Furthermore, the samples are simply that, samples. Running your code on the samples does not constitute a test of your implementation.

Each group must do a project demonstration where you will be given (small) problems to solve in real time. Your input method must allow you to quickly input a small problem (e.g. 5 relations, 5 indexes, 5 joins), and your software must produce an output report that is easily interpreted. Since you need to do this real-time demonstration, you must have your software available so that you can run it for us either from a UNIX account (CIT or CS dept.) an NT platform networked to your CIT or CS account, or from a laptop that you bring to the demonstration. Note that we will ask you ahead of time how many joins your algorithm can handle in a few seconds or less and will not ask you to do a problem larger than your algorithm can handle.

Extensions

If after you have implemented and tested the core, you are ready for more, there are many extensions possible. Some are listed below. If you have thought of another that you are interested in pursuing, talk to Professor LaPaugh to determine if it is appropriate. Before deciding to pursue an extension, keep two things in mind: 1)A well-implemented and tested core is the most important part of this project. 2)The goal of this project is not to earn the course the title "death databases". The project is open-ended so that it will be interesting and challenging to students regardless of background. If you have any doubt as to whether you should pursue an extension, talk to Professor LaPaugh. Some possible extensions:

Obviously the core handles a limited set of queries. Extend the set of queries that can be handled. This should be done in small increments. For example one can allow limited select operations in addition to the joins; one can allow joins with more than one equality; one can allow indexes on more than one attribute. Clearly there are many many steps in going from the core to a full set of queries with arbitrary indexes.
Explore the possibility of heuristic algorithms that are faster than the left-deep algorithm. For example, what if one determines the order in which the joins should be done by some other heuristic rule? Design such an algorithm and compare it to the left-deep algorithm on the same test cases, both in quality of solution and in running time.

Discussion of data formats

A.S. LaPaugh Mon Apr 9 23:06:05 EDT 2001

Princeton University Computer Science Dept.

Computer Science 425 Database Systems Andrea LaPaugh