Princeton University
Computer Science Dept.

Computer Science 425
Database Systems

Andrea LaPaugh

Spring 2001

Data Format for Project

We will give you sample data for the core part of the project in the following format. Your software is not required to read this format, but if it does not, you will have to hand-enter the data. This is not really so bad since we won't be giving you very big examples. The main purpose of defining a format is to make precise what constitutes an instance.

Format for relational catalog information

The relations for an instance will be specified in a simple ascii file, one line per relation. A line contains the following information, separated by spaces:

The indexes for an instance will be specified in a second ascii file, one line per index. A line contains the following information, separated by spaces: The number of buffers also needs to be specified; this will be given as a single positive integer ascii file.

Format of query specification

Finally, the query to be optimized needs to be specified. Note that many different queries can be optimized given a fixed set of relations, indexes, and buffers. Again we use a simple ascii file. Since the core asks you to consider only queries that consist of a collection joins with one equality constraint, we need only specify the relations and single attributes per relation to specify a query. We will specify a sequence of such joins as a list, which may be interpreted with left-to-right parenthesization. (Of course your query optimizer will determine the actual order in which the joins will be executed.) Note that except for the first join, the left-hand argument to a join will be the previously defined relation and the attribute will be in terms of the attributes of that relation, i.e. a combination of all the attributes of the joins so far. IMPORTANT NOTE: Equijoins, as defined in the textbook, eliminate the duplicate copy of the equijoin attribute. However, for the sake of simplicity, we will refer to attributes without duplicate elimination. This, of course can be converted to a specification in terms of the attribute positions with duplicate join attributes eliminated. (Note that this issue of duplicate attribute elimination also affects the estimate of the size (in pages) of the result of a join because eliminating the duplicate attribute shrinks the size of each tuple. You need not take this shrinkage into account when estimating the size of the relation resulting from a join. You may take this shrinkage into account in your implementation of the core, in which case it will be considered a small "extra" since the estimation equations become slightly more complicated.)

The form of our query specification is a sequence of lines, two per join:

For example, consider the following relations:
1: relation employee with attributes (name, street, city)
2: relation company  with attributes (co_name, city)
3: relation works with attributes (name, co_name, salary)
4: relation parents with attributes (name_mother, name_father, name_child)
The query:
1 1
3 1
5
2 1
1
4 3
corresponds to (JOIN on "name = name_child" ((JOIN on "co-name" (JOIN on "name" (employee, works)), company), parents))

Format of output of query optimizer

For any given query, your query optimizer must find the lowest-cost plan using the left-deep plan optimization algorithm. Specifying the plan requires stating the order of joins (relations and equality attributes involved) and for each join, the join algorithm to be used, including any indexes to be used. The estimated cost of this plan and whether any fields are sorted in the output relation should also be given. Also, there may be plans which are not lowest cost but result in the final relation being output in sorted order for some field as a by-product of the join evaluation, i.e. without sorting on the field after the joins are evaluated. For each field that can be sorted as a by-product of an evaluation plan, the lowest cost such plan, with its cost, should be identified. These results should be output in some human readable form, since we have to read it to see the results of your implementation. Note that, as with all of our discussions of query evaluation, "cost" refers to disk I/O cost.

Formats for extensions

Since any extensions are of your choosing, you may use any format you like. Just be sure we can understand the problems being solved and the results.


A.S. LaPaugh Tue Apr 10 22:47:23 EDT 2001