Princeton University
COS 217: Introduction to Programming Systems

Assignment 5: An IA-32 Assembler

Purpose

The purpose of this assignment is to help you understand IA-32 assembly language, IA-32 machine language, and the assembly process.

Background

An IA-32 assembler reads a source code (.s) file containing IA-32 assembly language code and writes an object (.o) file containing IA-32 machine language code. The object file is expressed in the UNIX Executable and Linking Format (ELF). An ELF file consists of sections. Three of the most important sections are the data section, the bss section, and the text section.

An assembler makes two passes. During pass 1 the assembler traverses the assembly language code to generate the sections that comprise the machine language code. It also generates a symbol table section; each binding of the symbol table relates a label to an offset within a particular section. Finally, it generates a relocation record section; each relocation record indicates an area of the generated code that must be patched. Then during pass 2 the assembler traverses the relocation records to patch the machine language code in the data and text sections.

Your Task

Your task in this assignment is to use the C programming language to create an assembler that can process a subset of IA-32 assembly language. More precisely...

You are given code that reads an assembly language source code file and creates an in-memory representation of the assembly language program. Your job is to create code that accepts the in-memory representation, performs two passes (as described above), and thereby produces an in-memory representation of the program's data, bss, text, symbol table, and relocation record sections. Finally, you are given code that accepts those in-memory sections and writes the object file.

The Assembly Language Subset

Your assembler should accept the subset of IA-32 assembly language that is described in A Subset of IA-32 Assembly Language for the Assembler Assignment. Note in particular these features of the subset:

Only data, bss, and text sections are allowed. The rodata section is excluded.
Assembly-time expressions must be simple decimal literals. For example, these instructions contain assembly-time expressions (highlighted) that are excluded:
```
.skip   4 * 100
.word   (18 + 5) * 3
movl    'a', %eax
movl    0x20, %ecx
```
Various operand types are allowed within movl and movb mnemonics.
Within movl and movb mnemonics, the only type of memory operands allowed are "base-pointer addressing" operands.
Only register operands are allowed in the pushl, popl, addl, subl, imull, idivl, and cmpl mnemonics.
Despite its small size, the subset is reasonably complete. Many tasks can be expressed using the subset, although often not as easily as with the complete assembly language.

The Modules

[Some of the modules are defined as abstract data types (ADTs), as described previously in our course. Other modules are defined as "abstract objects." An abstract object is a simpler alternative to an ADT. An ADT is appropriate when your program needs to create many objects of the same type; an abstract object is appropriate only when your program uses exactly one object of its type. See Sections 19.1 and 19.2 of our King textbook for more information about abstract objects.]

You are given modules that accomplish the task of reading the assembly language program and creating an in-memory representation:

Lexer (abstract object). The Lexer object is a lexical analyzer. It reads the sequence of characters that comprises an assembly language source code program, and converts it into a sequence of tokens.
Parser (abstract object). The Parser object is a syntactic analyzer. It accepts the sequence of tokens generated by the Lexer object, and verifies that the sequence is a syntactically valid assembly language program. While doing so it calls functions of other modules to translate the assembly language source code into an in-memory representation.

You also are given modules that comprise the in-memory representation:

Program (abstract object). The Program object is the in-memory representation of the assembly language program. It contains Instr objects.
Instr (abstract data type). An Instr object is an in-memory representation of an assembly language instruction. An Instr object has a name (e.g. addl, .globl), a type (directive, mnemonic, or label), and Operand objects.
Operand (abstract data type). An Operand object is an in-memory representation of an assembly language operand. An Operand object is either a number operand (e.g. 321), a label operand (e.g. somelabel, main), a string operand (e.g. "somestring", ".text"), an immediate number operand (e.g. $321), an immediate label operand (e.g. $somelabel), a register operand (e.g. %eax), or a memory reference operand (e.g. 321(%eax)).

Finally, you are given these modules:

LabelInfo (abstract data type). A LabelInfo object contains information about a label: its section, its offset within its section, its type (local or global), and its label sequence number. The SymTable object that your pass1 code creates should contain values that are LabelInfo objects.
Section (abstract data type). A Section object contains an in-memory representation of a section. Your pass1 and pass 2 code should create three Section objects: one for the data section, one for the bss section, and one for the text section. A Section object contains the code that comprises the section. The text and data Section objects also may contain RelRecord objects.
RelRecord (abstract data type). A RelRecord object is a relocation record through which your assembler can indicate "patches" that its pass2 or the linker must perform. A RelRecord object contains the offset of the patch, and a label sequence number that identifies a label, and the type of the patch. The type is either a relative patch (appropriate for patching a jump or call instruction), or an absolute patch (appropriate for patching movl instructions that contain an immediate label operand).
ElfGen (abstract object). The ElfGen object accepts the SymTable object and the three Section objects, and writes the object file.
Assembler. The Assembler module contains a main function that orchestrates the assembly process by calling functions of the other modules.

You also should use your SymTable ADT from Assignment 3. You may use either the linked list or the hash table implementation of your SymTable ADT. If your SymTable ADT is not working, we will provide you with working symtable.h and symtable.o files at your request.

You should create modules named Pass1 and Pass2, as described below.

The Pass1 Module

Pass1 should be an abstract object. The Assembler object will give the Pass1 object an empty SymTable object, and three empty Section objects: one for the data section, one for the bss section, and one for the text section.

The Pass1 module should:

Fetch data from the Program object, the Instr objects that it contains, and Operand objects that they contain.
Create bindings in the given SymTable object. The Pass1 object should create one binding for each label that is used in the program. Each key should be a label. Each value should be a LabelInfo object.
Fill in the text, data, and bss Section objects with machine language code.
Create RelRecord objects associated with the text and data Section objects, thus indicating areas of those sections that must be patched.

Your Pass1 object may assume that the Parser object checks the given assembly language program for syntax errors. Your Pass1 object should detect and report only one semantic error: a duplicate definition of a label.

Your Pass1 object should contain no memory leaks. Your Pass1 should free all memory that it has dynamically allocated. More precisely, the given modules take care of freeing memory consumed by the Program, Instr, Operand, Section, RelRecord, SymTable, and LabelInfo objects. Your Pass1 object should free any other memory that it dynamically allocates. It is likely that there will be no such memory.

Testing the Pass1 Module

You can test your Pass1 module using:

the given sampleassembler program,
the given assembly language programs (01testdata.s, 02testdatabss.s, 03testglobl.s, 04testaddsub.s, 05testmuldiv.s, 06testpushpop.s, 07testmovl.s, 08testmovb.s, 09testcallret.s, 10testjump.s, 11testunusual.s, 12testduplabel.s),
the standard UNIX readelf command,
the standard UNIX objdump command,

and this procedure:

Temporarily comment-out all calls to Pass2 functions in the assembler.c file.
Execute the command "sampleassembler -o one.o 01testdata.s", thus producing the file one.o.
Execute the command "assembler -o two.o 01testdata.s", thus producing the file two.o.
Execute the command "readelf -s one.o" to display the symbol table of one.o to standard output.
Execute the command "readelf -s two.o" to display the symbol table of two.o to standard output.
Compare the symbol tables. they should be identical.
Execute the command "objdump -s --section=.data one.o" to display the contents of the data section of one.o to standard output.
Execute the command "objdump -s --section=.data two.o" to display the contents of the data section of two.o to standard output.
Compare the data sections. They should be identical.
Execute the command "objdump -d one.o" to display the contents of the text section of one.o to standard output.
Execute the command "objdump -d two.o" to display the contents of the text section of two.o to standard output.
Compare the text sections. Then should be identical.
Execute the command "readelf -r one.o" to display the relocation records of one.o to standard output.
Execute the command "readelf -r two.o" to display the relocation records of two.o to standard output.
Compare the relocation records. They should be identical.
Repeat steps 2 to 15 for each example assembly language program.

You also should test your Pass1 module by using it to assemble the 16hello.s and 17powerfunction.s assembly language programs, thus producing 16hello.o and 17powerfunction.o. Then use gcc to link those file, thus producing executable files. Finally, execute those files. They should execute properly.

You will find the given bash shell script objdiff useful. Given an assembly language program, it assembles the program using sampleassembler and using your assembler, and then uses the standard UNIX diff command to compare the symbol table, data section, text section, and relocation records of the two resulting object files. Thus it implements steps 2-15 as described above.

You will also find the given bash shell script grade5 useful. It executes objdiff for each example assembly language program. It also uses your assembler to assemble 16hello.s and 17powerfunction.s, uses gcc to link them, and executes them.

The Pass2 Module

Pass2 should be an abstract object. The Assembler object will give the Pass2 object the filled SymTable object (with its LabelInfo objects), the filled text Section object (with its RelRecord objects), and the filled data Section object (with its RelRecord objects). The Pass2 object should traverse each Section's RelRecord objects. For each RelRecord object:

If Pass2 has enough information to patch the code as specified by the RelRecord object, then it should do so, and should remove the RelRecord object from the Section object.
If Pass2 does not have enough information to patch the code as specified by the RelRecord object, then it should leave the RelRecord object intact -- thus informing the linker that it must perform the patch.

Your Pass2 object should contain no memory leaks. Your Pass2 should free all memory that it has dynamically allocated. More precisely, the given modules take care of freeing memory consumed by the Section, RelRecord, SymTable, and LabelInfo objects. Your Pass2 object should free any other memory that it dynamically allocates.

Testing the Pass2 Module

You can test your Pass2 module using the procedure described above, adding the 13testpass2call.s, 14testpass2jump.s, and 15testpass2movl.s files.

Logistics

You should develop on hats. Use xemacs to create source code. Use gdb to debug.

The directory /u/cos217/Assignment5 contains files that are pertinent to the assignment:

sampleassembler. sampleassembler is an executable file of a working assembler.
The source code files for all given modules.
A partial makefile that you can use as the basis for defining your complete makefile. Specifically, you must add the dependency rules for your Pass1 and Pass2 modules, and for any additional modules that you create. (It is likely that you will create no additional modules.)
The assembly language programs that you can use to test your assembler.
The objdiff bash shell script.
The grade5 bash shell script.

You should submit:

The source code files for your Pass1 and Pass2 modules (pass1.h, pass1.c, pass2.h, pass2.c).
The source code files for your SymTable module (symtable.h, symtable{list|hash}.c).
The source code files for any additional modules that you have created.
A makefile. The first dependency rule should build your entire program, compiling with the -Wall, -ansi, and -pedantic options.
A readme file.

Your readme file should contain:

Your name.
A description of whatever help (if any) you received from others while doing the assignment, and the names of any individuals with whom you collaborated, as prescribed by the course "Policies" web page.
(Optionally) An indication of how much time you spent doing the assignment.
(Optionally) Your assessment of the assignment.
(Optionally) Any information that will help us to grade your work in the most favorable light. In particular you should describe all known bugs.

You should not submit the given source code files.

Submit your work electronically via the command:

/u/cos217/bin/i686/submit 5 pass1.h pass1.c pass2.h pass2.c 
   symtable.h symtable{list|hash}.c othersourcecodefilesthatyoucreated makefile readme

Grading

As always, we will grade your work on functionality and design, and will consider understandability to be an important aspect of good design. Defining each function so it does a single small well-defined task is critical to understandability, as are function-level comments in both .h and .c files. To encourage good coding practices, we will take off points based on warning messages during compilation.

Princeton University COS 217: Introduction to Programming Systems