Hints for Assignment 2

Sat Feb 6 13:40:24 EST 2010

You have to find all the places where the new features require a change. That includes awkgram.y for anything that changes the grammar, like repeat-until, but not for adding a new built-in function like htoi. Functions are all declared in proto.h, so you will need to add the names of any new functions to that. Lexical analysis is in lex.c; you're adding some new keywords, so they go into the table. Make sure you heed the comment at the top of the table!

For the first part, the AWK grammar needs new rules that specify the syntax of a repeat-until statement and creates the right kind of node in the parse tree. There is a new keyword that lexical analysis has to recognize and presumably a new function to be written and added to the program; that function defines the semantics. Find the code that handles do-while, then copy, paste, and edit carefully. There are two places in the grammar proper where something has to be added, plus some declarations at the front. Watch out for while vs do-while.

For the second part, htoi, there are no grammar changes but there is a new reserved word, a defined constant, and you have to add new semantics in the right place, in the same place where the other built-in functions are handled.

The third part is straightforward fiddling of the lexical analyzer. The lexer has a lookahead mechanism that you can use to determine what single character is coming next before actually reading it.

In the grammar, the components of the right hand side of a rule are referred to as $1, $2, etc. Each has a type that is set at the top of the grammar. These types keep the C compiler happy. The result of a rule is called $$. By default, $$ is set to $1, but if you want to pass some different value, you have to assign it explicitly, e.g., $$ = $3.

Some constants are defined in the grammar, including most of the language keywords like WHILE. Others, notably the names of built-in functions, are defined in awk.h. There is no good reason for this difference and probably never was, but it's now enshrined. This affects where you have to make changes.

Basically, AWK parses an awk program and creates a parse tree made up of Nodes in the interior and Cells at the leaves; Cells correspond to values, but they hold a variety of things, identified by type fields and often lied about with casts. This is certainly not a model of good software design. The values in Cells are usually set by setsval and setfval, and retrieved by getsval and getfval, but these ostensible interfaces are often bypassed in an effort to go faster (misplaced efficiency) or to do an operation that is more complicated.

During execution, AWK starts at the root of the parse tree and calls the appropriate semantic routine from run.c for each Node encountered; this is highly recursive since most Nodes have children. The function execute(Node*) is the focal point; it decides what function to call by indexing into an array of function pointers created by maketab in proctab.c and compiled into proctab. Return values from functions determine the results, including some control flow constructs like break that require early termination of loops.

Parse tree Nodes are created during parsing by functions named stat1(), stat2(), etc., distinguished by the number of children they have. Repeat-until is really the same as do-while: it has two children, the statements to be repeated and the condition that terminates the loop. Some constructs of variable length use a linked list of Nodes; this includes argument lists for functions, both built-in and user-defined. In such cases, one finds the arguments by following the list. Atan2 provides a minimal example in the built-in function category, though it's not needed here since htoi takes only one argument; the int function is a better example.

The link between the type of syntactic object (e.g., REPEAT) stored in a node and the function that is called when that node is to be executed is set by a table. The tricky bit is that the table is defined in a C program in maketab.c, and created by compiling and running maketab to make another C file (proctab.c) that is then compiled. This is all handled by the makefile, but you have to remember to update maketab.c or your function won't get called.

AWK uses an intricate and somewhat flaky scheme of temporary Cells to hold variables, constants, and the like during computation; managing those correctly is a pain, so you should be careful to follow the existing patterns precisely or you will create a memory leak or worse. These temporary cells are returned by most semantic functions. Find a built-in that is semantically close to htoi and modify its code.

Somewhat the same comments apply to string management. Although there was originally a string abstraction, it's too often violated. But roughly, getsval(Cell*) returns the string stored in the Cell, but it's not your copy, so if you want to mutate it, you have to make a copy (allocating space with tostring()) and work on that. In this problem, you should not have to do this.

All numbers in Awk are of type Awkfloat, which is a double. The functions getfval() and setsval() update the numeric value stored in a Cell.