Hints for Assignment 2

Thu Dec 28 21:08:15 EST 2017

The Awk interpreter reads an awk program, breaks it into syntactic tokens like ++ and while with lex.c, parses it according to the grammar in awkgram.y, generating a parse tree that is then interpreted by functions in run.c, using supporting functions in other source files.

Spend a bit of time figuring out the basic structure of the program. Use grep and the search functions of your editor to find out where things are defined and used. Make small changes and see what happens.
 

Adding the new comment convention requires straightforward fiddling of the lexical analyzer. The lexer has a lookahead mechanism that you can use to test what single character is coming next before actually reading it, so there's no need to back up. That will help you identify // after you see /.
 

htoi is a built-in function, so you have to figure out how existing built-in functions are handled. You might look for a math or string-handling function first, and see how those work. You have to add a new reserved word (where are reserved words defined?), a new defined constant that corresponds to that word (where are existing constants defined?), and new semantics in the same place where the other built-in functions are handled (what function is that?).
 

For repeat-until, you have to find all the places where the new statement requires a change.

Lexical analysis is in lex.c; you're adding two new keywords, so they go into the table. Make sure you heed the comment at the top of the table!

The AWK grammar needs new rules that specify the syntax of a repeat-until statement and creates the right kind of node in the parse tree. There are two places in the grammar proper where something has to be added, plus some declarations at the front. The keyword identified by the lexer, say repeat, is handled as a symbolic constant with a name like REPEAT. Watch out for while vs do-while; they involve the same keyword but have different semantics.

In the grammar, the components of the right hand side of a rule are referred to as $1, $2, etc. Each has a type that is set at the top of the grammar; the possible types are defined in the %union declaration and used later in the grammar to indicate the proper type of an element on the parsing stack. These types keep the C compiler happy. The result of a rule is called $$. By default, $$ is set to $1, but if you want to pass some different value, you have to assign it explicitly, e.g., $$ = $3. In this assignment, you shouldn't have to do anything beyond careful copying.

You need to write a new function that defines the semantics. Find the code that implements the do-while statement, then copy, paste, and edit it carefully. Functions are all declared in proto.h, so you will need to add the name of your new function to that.

AWK parses an awk program and creates a parse tree made up of Nodes in the interior and Cells at the leaves; Cells correspond to values, but they hold a variety of things, identified by type fields and often lied about with casts, which is certainly not a model of good software design. The values in Cells are usually set by setsval and setfval, and retrieved by getsval and getfval, but these ostensible interfaces are often bypassed in an effort to go faster (misplaced efficiency) or to do an operation that is more complicated.

During execution, AWK starts at the root of the parse tree and calls the appropriate semantic routine from run.c for each Node encountered; this is highly recursive since most Nodes have children. The function execute(Node*) is the focal point; it decides what function to call by indexing into an array of function pointers created by maketab in proctab.c and compiled into proctab. Return values from functions determine the results, including some control flow constructs like break that require early termination of loops.

The link between the type of syntactic object (e.g., REPEAT) stored in a node and the function that is called when that node is to be executed is set by a table. The tricky bit is that the table is defined in a C program in maketab.c, and created by compiling and running maketab to make another C file (proctab.c) that is then compiled. This is all handled by the makefile, but you have to remember to update maketab.c or your function won't get called.

Parse tree Nodes are created during parsing by functions named stat1(), stat2(), etc., distinguished by the number of children they have. Repeat-until is really the same as do-while: it has two children, the statements to be repeated and the condition that terminates the loop.

AWK uses an intricate and fragile scheme of temporary Cells to hold variables, constants, and the like during computation; managing those correctly is a pain, so you must follow the existing patterns precisely or you will create a memory leak or worse. These temporary cells are returned by most semantic functions. Find a built-in that is semantically close to htoi and modify its code.

Somewhat the same comments apply to string management. Although there was originally a string abstraction, it's too often violated. But roughly, getsval(Cell*) returns the string stored in the Cell, but it's not your copy, so if you want to mutate it, you have to make a copy (allocating space with tostring()) and work on that. In this assignment, you should not have to do this.

All numbers in Awk are of type Awkfloat, which is a double. The functions getfval() and setfval() handle numeric values for Cells.