Hints for Assignment 2

You might have to change the yacc flags in the makefile; the -S option seems to no longer be required (or legal) on the CS machines. If you compile on OIT systems like nobel, you will have to use Bison; set the flag for that.

You have to find all the places where the new features require a change. That includes awkgram.y for anything that changes the grammar, like for. Functions that are used in more than one file are declared in proto.h, so you will need to add the names of any new functions to that; there will be one for the new for statement. Lexical analysis is in lex.c; new keywords must be put into the table. Make sure you heed the comment at the top of the table!

For the new for statement, the Awk grammar needs new rules that specify its syntax and create the right kind of node in the parse tree. Count the nodes! You will have to add two trivial functions to parse.c to handle nodes with more children, and those too have to be included in proto.h.

There are three new keywords that lexical analysis has to recognize and a new function to be written and added to run.c; that function defines the semantics and will be parallel to the existing for statements.

In the grammar, the components of the right hand side of a rule are referred to as $1, $2, etc. Each has a type that is set at the top of the grammar. These types keep the C compiler happy. You will have to add types for the new kind of for itself, and for from, to and by. Since these new ones have no precedence or associativity, you can add them as %token types.

The result of a rule is called $$. By default, $$ is set to $1, but if you want to pass some different value, you have to assign it explicitly, e.g., $$ = $3. There is no need for this in this specific assignment.

In run.c, find the function that handles the for statement, then copy, paste, and edit carefully. The new function is actually a little simpler. You do not need to worry about losing memory; the tempcell mechanism is baroque and not worth the effort of dealing with it properly. Just call execute to get the Cell you need, then call getfval to get the value. The type is Awkfloat.

Some constants are defined in the grammar, including ones for the language keywords like WHILE. Others, notably the names of built-in functions, are defined in awk.h. There is no good reason for this difference and probably never was, but it's now enshrined. This affects where you have to make changes.

Basically, Awk parses an Awk program and creates a parse tree made up of Nodes in the interior and Cells at the leaves; Cells correspond to values, but they hold a variety of things, identified by type fields and often lied about with casts. This is certainly not a model of good software design. The values in Cells are usually set by setsval and setfval, and retrieved by getsval and getfval, but these ostensible interfaces are often bypassed in an effort to go faster (misplaced efficiency) or to do an operation that is more complicated. All numbers in Awk are of type Awkfloat, which is a double. The functions getfval() and setfval() retrieve and update the numeric value stored in a Cell.

Parse tree nodes are created during parsing by functions named stat1(), stat2(), etc., distinguished by the number of children they have. The new for has more children than any existing statement; you have to add new functions node5 and stat5.

The link between the type of syntactic object (e.g., FOR) stored in a node and the function that is called when that node is to be executed is set by a table. The tricky bit is that the table is defined in a C program in maketab.c, and created by compiling and running maketab to dynamically create another C file proctab.c that is then compiled. This is all handled by the makefile, but you have to remember to update maketab.c or your new function won't get called.

During execution, awk starts at the root of the parse tree and calls the appropriate semantic routine from run.c for each Node encountered; this is highly recursive since most Nodes have children. The function execute(Node*) is the focal point; it decides what function to call by indexing into an array of function pointers created by maketab in proctab.c and compiled into proctab. Return values from functions determine the results, including some control flow constructs like break that require early termination of loops.

In run.c, semantic functions are called with a type and an array corresponding to the nodes for that type, uniformly called a[0], a[1], etc. The new for will have five of those. You should create a new function analogous to forstat to handle the new for statement.

Awk uses an intricate and somewhat flaky scheme of temporary Cells to hold variables, constants, and the like during computation; managing those correctly is a pain, so you should be careful to follow the existing patterns precisely or you will create a memory leak or worse. These temporary cells are returned by most semantic functions. Use the code for the existing for as a model, but create a new function rather than trying to fit your new code into the existing forstat. Your new code should actually be simpler, because you can set the loop limits at the beginning and then just forget about them.

Somewhat the same comments apply to string management. Although there was originally a string abstraction, it's too often violated. But roughly, getsval(Cell*) returns the string stored in the Cell, but it's not your copy, so if you want to mutate it, you have to make a copy (allocating space with tostring()) and work on that. In this assignment, you should not have to do this.

 

The new comment convention requires only straightforward fiddling of the lexical analyzer. The lexer has a lookahead mechanism that you can use to determine what single character is coming next before actually reading it. Don't forget the comment at the front of the table.

 

For the third part, preventing reading a directory, it's easiest to write a new function that tests whether a file is a directory, then change all the calls to fopen to point to your new function, which is only 5 or 6 lines long. You need to include a new header file to access the stat information and you have to figure out the properties of the stat system call.