Princeton University
COS 217: Introduction to Programming Systems
Assignment 5: An IA-32 Assembler
Purpose
The purpose of this assignment is to help you understand IA-32 assembly
language, IA-32 machine language, and the assembly process.
Background
An IA-32 assembler reads a source code (.s) file containing IA-32 assembly
language code and writes an object (.o) file containing IA-32 machine language
code.
The object file is expressed in the UNIX Executable and Linking Format (ELF). An
ELF file consists of sections. Three of the most important sections are the data
section, the bss section, and the text section.
An assembler makes two passes. During pass 1 the
assembler traverses the assembly language code to generate the sections that comprise the machine
language code. It also generates a symbol table section; each binding of the symbol table relates a label to
an offset within a particular section. Finally, it
generates a relocation record section; each relocation record indicates
an area of the generated code that must be
patched. Then during pass 2 the assembler traverses the relocation records to patch the machine
language code in the data and text sections.
Your Task
Your task in this assignment is to use the C programming language to create
an assembler that can process a subset of IA-32 assembly language. More precisely...
You are given code that reads an assembly language source code file and creates an in-memory
representation of the assembly language program. Your job is to create
code that accepts the in-memory representation, performs two passes (as
described above), and thereby produces an in-memory representation of the
program's data, bss, text, symbol table, and relocation record sections. Finally, you are given code that accepts
those in-memory sections and writes the object file.
The Assembly Language Subset
Your assembler should accept the subset of IA-32 assembly
language that is described in A Subset of IA-32 Assembly
Language for the Assembler Assignment.
Note in particular these features of the subset:
- Only data, bss, and text sections are allowed. The rodata
section is excluded.
- Assembly-time expressions must be simple decimal literals. For
example, these instructions contain assembly-time expressions (highlighted)
that are excluded:
.skip 4 * 100
.word (18 + 5) * 3
movl 'a', %eax
movl 0x20, %ecx
- Various operand types are allowed within movl and movb mnemonics.
- Within movl and movb mnemonics, the only type of memory operands allowed
are "base-pointer addressing" operands.
- Only register operands are allowed in the pushl, popl, addl, subl, imull,
idivl, and cmpl mnemonics.
- Despite its small size, the subset is reasonably complete. Many tasks can
be expressed using the subset, although often not as easily as with the
complete assembly language.
The Modules
[Some of the modules are defined as abstract data types (ADTs), as described
previously in our course. Other modules are defined as "abstract objects."
An abstract object is a simpler alternative to an ADT. An ADT is appropriate
when your program needs to create many objects of the same type; an abstract
object is appropriate only when your program uses exactly one
object of its type. See Sections 19.1 and 19.2 of our King textbook for more
information about abstract objects.]
You are given modules that accomplish the task of reading the assembly
language program and creating an in-memory representation:
- Lexer (abstract object). The Lexer object is a lexical analyzer. It reads the sequence of
characters that comprises an assembly language source code program, and converts it into a
sequence of tokens.
- Parser (abstract object). The Parser object is a syntactic analyzer. It accepts the sequence of
tokens generated by the Lexer object, and verifies that the sequence is a
syntactically valid assembly language program. While doing so
it calls functions of other modules to translate the assembly language
source code into an in-memory representation.
You also are given modules that comprise the in-memory representation:
- Program (abstract object). The Program object is the in-memory representation
of the assembly language program. It contains Instr objects.
- Instr (abstract data type). An Instr object is an in-memory
representation of an assembly language instruction. An Instr object has a
name (e.g. addl, .globl), a type (directive, mnemonic, or label), and
Operand
objects.
- Operand (abstract data type). An Operand object is an in-memory
representation of an assembly language operand. An Operand object is either a
number operand (e.g. 321), a label operand (e.g. somelabel, main), a string
operand (e.g. "somestring", ".text"), an immediate number operand (e.g. $321),
an immediate label operand (e.g. $somelabel), a register operand (e.g. %eax),
or a memory reference operand (e.g. 321(%eax)).
Finally, you are given these modules:
- LabelInfo (abstract data type). A LabelInfo object contains
information about a label: its section, its offset within its section, its
type (local or global), and its label sequence number. The SymTable object
that your pass1 code creates should contain values that are LabelInfo
objects.
- Section (abstract data type). A Section object contains an
in-memory representation of a section. Your pass1 and pass 2 code should
create three Section objects: one for the data section, one for the bss
section, and one for the text section. A Section object contains the code that comprises the section. The text
and data
Section objects also may contain RelRecord objects.
- RelRecord (abstract data type). A RelRecord object is a relocation
record through which your assembler can indicate
"patches" that its pass2 or the linker must perform. A RelRecord object
contains the offset of the patch, and a label sequence number that
identifies a label, and the type of the patch. The type is either a
relative patch (appropriate for patching a jump or call instruction), or
an absolute patch (appropriate for patching movl instructions that
contain an immediate label operand).
- ElfGen (abstract object). The ElfGen object accepts the SymTable
object and the three
Section objects, and writes the object file.
- Assembler. The Assembler module contains a main function that
orchestrates the assembly process by calling functions of the other modules.
You also should use your SymTable ADT from Assignment 3. You may use
either the linked list or the hash table implementation of your SymTable ADT. If your SymTable ADT is not working, we will provide you with working symtable.h
and symtable.o files at your request.
You should create modules named Pass1 and Pass2, as described
below.
The Pass1 Module
Pass1 should be an abstract object. The Assembler object will give the Pass1
object an empty SymTable object, and three empty Section objects: one for the data
section, one for the bss section, and one for the text section.
The Pass1 module should:
- Fetch data
from the Program object, the Instr objects that it contains, and Operand objects
that they contain.
- Create bindings in the given SymTable object. The Pass1 object should
create one binding for each label that is used in the program.
Each key should be a label. Each value should be a LabelInfo object.
- Fill in the text, data, and bss Section objects with machine language
code.
- Create RelRecord objects associated with the text and data Section
objects, thus indicating areas of those sections that must be patched.
Your Pass1 object may assume that the Parser object checks the given assembly language program
for syntax errors. Your Pass1 object should detect and report only one semantic
error: a duplicate definition of a label.
Your Pass1 object should contain no memory leaks. Your Pass1 should free all memory that
it has dynamically allocated. More precisely, the
given modules take care of freeing memory consumed by the Program, Instr,
Operand, Section, RelRecord, SymTable, and LabelInfo objects. Your Pass1 object
should free any
other memory that it dynamically allocates. It is likely that there will be no
such memory.
Testing the Pass1 Module
You can test your Pass1 module using:
- the given sampleassembler program,
- the given assembly language
programs (01testdata.s, 02testdatabss.s, 03testglobl.s, 04testaddsub.s,
05testmuldiv.s, 06testpushpop.s, 07testmovl.s, 08testmovb.s, 09testcallret.s,
10testjump.s, 11testunusual.s, 12testduplabel.s),
- the standard UNIX readelf command,
- the standard UNIX objdump command,
and this procedure:
- Temporarily comment-out all calls to Pass2 functions in the assembler.c
file.
- Execute the command "sampleassembler -o one.o 01testdata.s", thus producing the file one.o.
- Execute the command "assembler -o two.o 01testdata.s", thus producing the file two.o.
- Execute the command "readelf -s one.o" to display the symbol
table of one.o to standard output.
- Execute the command "readelf -s two.o" to display the symbol
table of two.o to standard output.
- Compare the symbol tables. they should be identical.
- Execute the command "objdump -s --section=.data one.o" to display the
contents of the data section of one.o to standard output.
- Execute the command "objdump -s --section=.data two.o" to display the
contents of the data section of two.o to standard output.
- Compare the data sections. They should be identical.
- Execute the command "objdump -d one.o" to display the
contents of the text section of one.o to standard output.
- Execute the command "objdump -d two.o" to display the
contents of the text section of two.o to standard output.
- Compare the text sections. Then should be identical.
- Execute the command "readelf -r one.o" to display the relocation
records of one.o to standard output.
- Execute the command "readelf -r two.o" to display the relocation
records of two.o to standard output.
- Compare the relocation records. They should be identical.
- Repeat steps 2 to 15 for each example assembly language program.
You also should test your Pass1 module by using it to assemble the 16hello.s and 17powerfunction.s
assembly language programs, thus producing 16hello.o and 17powerfunction.o. Then use gcc to link those
file, thus producing
executable files. Finally, execute those files. They should execute properly.You will find the given
bash shell script objdiff useful. Given an assembly
language program, it assembles the program using sampleassembler and using your assembler,
and then uses the standard UNIX
diff command to compare the symbol table, data section, text section, and
relocation records of the two resulting object files. Thus it implements steps
2-15 as described above.
You will also find the given bash shell script grade5 useful. It executes
objdiff for each example assembly language program. It also uses your assembler
to assemble 16hello.s and 17powerfunction.s, uses gcc to link them, and
executes them.
The Pass2 Module
Pass2 should be an abstract object. The Assembler object will give the Pass2
object the filled SymTable object (with its LabelInfo objects), the filled text Section object
(with its RelRecord objects), and the filled data Section object (with its RelRecord objects). The Pass2 object should traverse each Section's RelRecord
objects. For each RelRecord object:
- If Pass2 has enough information to patch the code as specified by the
RelRecord object, then it should do so, and should remove the RelRecord object
from the Section object.
- If Pass2 does not have enough information to patch the code as specified
by the RelRecord object, then it should leave the RelRecord object intact --
thus informing the linker that it must perform the patch.
Your Pass2 object should contain no memory leaks. Your Pass2 should free all
memory that it has dynamically allocated. More precisely, the given modules take
care of freeing memory consumed by the Section, RelRecord, SymTable, and
LabelInfo objects. Your Pass2 object should free any other memory that it
dynamically allocates.
Testing the Pass2 Module
You can test your Pass2 module using the procedure described above, adding
the 13testpass2call.s, 14testpass2jump.s, and 15testpass2movl.s files.
Logistics
You should develop on hats. Use xemacs to create
source code. Use gdb to debug.
The directory /u/cos217/Assignment5 contains files that are pertinent to the
assignment:
- sampleassembler. sampleassembler is an executable file of a working assembler.
- The source code files for all given modules.
- A partial makefile that you can use as the basis for defining your
complete makefile. Specifically, you must add the dependency rules for your Pass1 and Pass2
modules, and for any additional modules that you create. (It is likely
that you will create no additional modules.)
- The assembly language programs that you can use to test your assembler.
- The objdiff bash shell script.
- The grade5 bash shell script.
You should submit:
- The source code files for your Pass1 and Pass2 modules (pass1.h, pass1.c,
pass2.h, pass2.c).
- The source code files for your SymTable module (symtable.h, symtable{list|hash}.c).
- The source code files for any additional modules that you have created.
- A makefile. The first dependency rule should build your entire program,
compiling with the -Wall, -ansi, and -pedantic options.
- A readme file.
Your readme file should contain:
- Your name.
- A description of whatever help (if any) you received from others while
doing the assignment, and the names of any individuals with whom you
collaborated, as prescribed by the course "Policies" web page.
- (Optionally) An indication of how much time you spent doing the
assignment.
- (Optionally) Your assessment of the assignment.
- (Optionally) Any information that will help us to grade your work in the
most favorable light. In particular you should describe all known bugs.
You should not submit the given source code files.
Submit your work electronically via the command:
/u/cos217/bin/i686/submit 5 pass1.h pass1.c pass2.h pass2.c
symtable.h symtable{list|hash}.c othersourcecodefilesthatyoucreated makefile readme
Grading
As always, we will grade your work on functionality and design, and will consider
understandability to be an important aspect of good design. Defining each
function so it does a single small well-defined task is critical to
understandability, as are function-level
comments in both .h and .c files. To encourage good coding practices,
we will take off points based on warning messages during compilation.