Princeton University
COS 217: Introduction to Programming Systems

Assignment 1: A "De-Comment" Program


Purpose

The purpose of this assignment is to help you learn or review (1) the fundamentals of the C programming language, (2) the details of the "de-commenting" task of the C preprocessor, and (3) how to use the the Linux operating system and GNU software, especially bash, emacs, and gcc217.

Students from past semesters reported taking, on average, 8 hours to complete this assignment.


Rules

This assignment is an individual assignment, not a teams-of-two assignment.

Make sure you study the course Policies web page before doing this assignment or any of the COS 217 assignments. In particular, note that you may use a variety of "human" sources of information while doing assignments, including the course staff members, the lab teaching assistants, and other current students via Piazza.

Each assignment has a challenge part. While doing the challenge part of an assignment, you are bound to observe the course policies regarding assignment conduct as given in the course Policies web page, plus one additional policy: you may not use any "human" sources of information. That is, you may not consult with the course's staff members, the lab teaching assistants, other current students via Piazza, or any other people while working on the challenge part of an assignment, except for clarification of requirements.

The challenge part is designed to be a kind of relief valve: if you find that an assignment is taking too much time, then you might consider not doing the assignment's challenge part — thereby accepting a (typically small) penalty on your grade for that assignment.

Usually the challenge part is, as its name implies, more challenging than the rest of the assignment. Sometimes it involves material from the course's readings that hasn't (yet) been covered in lectures or precepts.

For this assignment, avoiding the use of global variables (as described below) is the challenge part. That part is worth 2 percent of this assignment. So if you don't do the challenge part and all other parts of your assignment solution are perfect and submitted on time, then your grade for the assignment will be 98 percent. In subsequent assignments the challenge part will be worth more.


Background

The C preprocessor is an important part of the C programming system. Given a C source code file, the C preprocessor performs three jobs:

The second of those jobs — the de-comment job — is more substantial than you might think. For example, when de-commenting a program the C preprocessor must be sensitive to:


The Task

Your task is to compose a C program named decomment that performs a subset of the de-comment job of the C preprocessor, as defined below.


Functionality

Your program must be a Linux filter. A filter is a program that reads characters from the standard input stream, and writes characters to the standard output stream and possibly to the standard error stream. Specifically, your program must (1) read text, presumably a C program, from the standard input stream, (2) write that same text to the standard output stream with each comment removed, as prescribed below, and (3) write error and warning messages as appropriate to the standard error stream. A typical execution of your program from the shell might look like this:

./decomment < somefile.c > somefileWithoutComments.c 2> errorsAndWarnings

In the following examples a space character is shown as "s" and a newline character as "n".

Your program must replace each single-line comment with a space. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*def*/ghin abcsghin
abc/*def*/sghin abcssghin
abcs/*def*/ghin abcssghin

Your program must define "comment" as in the C90 standard. In particular, your program must consider text of the form (/*...*/) to be a comment. It must not consider text of the form (//...) to be a comment. Example:

Standard Input Stream Standard Output Stream Standard Error Stream
abc//defn abc//defn

Your program must allow a comment to span multiple lines. That is, your program must allow a comment to contain newline characters. Your program must replace each multi-line comment with a space, followed by newline characters as necessary to preserve the original line numbering. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*defnghi*/jklnmnon abcsnjklnmnon
abc/*defnghinjkl*/mnonpqrn abcsnnmnonpqrn

Your program must not recognize nested comments. Example:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*def/*ghi*/jkl*/mnon abcsjkl*/mnon

Your program must handle C string literals. In particular, your program must not consider text of the form (/*...*/) that occurs within a string literal ("...") to be a comment. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"def/*ghi*/jkl"mnon abc"def/*ghi*/jkl"mnon
abc/*def"ghi"jkl*/mnon abcsmnon
abc/*def"ghijkl*/mnon abcsmnon

Similarly, your program must handle C character literals. In particular, your program must not consider text of the form (/*...*/) that occurs within a character literal ('...') to be a comment. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc'def/*ghi*/jkl'mnon abc'def/*ghi*/jkl'mnon
abc/*def'ghi'jkl*/mnon abcsmnon
abc/*def'ghijkl*/mnon abcsmnon

Note that the C compiler would consider the first of those examples to be erroneous (multiple characters in a character literal). But many C preprocessors would not, and your program must not.

Your program must handle escaped characters within string literals. That is, when your program reads a backslash (\) while processing a string literal, your program must consider the next character to be an ordinary character that is devoid of any special meaning. In particular, your program must consider text of the form ("...\"...") to be a valid string literal which happens to contain the double quote character. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"def\"ghi"jkln abc"def\"ghi"jkln
abc"def\'ghi"jkln abc"def\'ghi"jkln

Similarly, your program must handle escaped characters within character literals. That is, when your program reads a backslash (\) while processing a character literal, your program must consider the next character to be an ordinary character that is devoid of any special meaning. In particular, your program must consider text of the form ('...\'...') to be a valid character literal which happens to contain the quote character. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc'def\'ghi'jkln abc'def\'ghi'jkln
abc'def\"ghi'jkln abc'def\"ghi'jkln

Note that the C compiler would consider both of those examples to be erroneous (multiple characters in a character literal). But many C preprocessors would not, and your program must not.

Your program must handle newline characters in C string literals without generating errors or warnings. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"defnghi"jkln abc"defnghi"jkln
abc"defnghinjkl"mno/*pqr*/stun abc"defnghinjkl"mnosstun

Note that a C compiler would consider those examples to be erroneous (newline character in a string literal). But many C preprocessors would not, and your program must not.

Similarly, your program must handle newline characters in C character literals without generating errors or warnings. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc'defnghi'jkln abc'defnghi'jkln
abc'defnghinjkl'mno/*pqr*/stun abc'defnghinjkl'mnosstun

Note that a C compiler would consider those examples to be erroneous (multiple characters in a character literal, newline character in a character literal). But many C preprocessors would not, and your program must not.

Your program must handle unterminated string and character literals without generating errors or warnings. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc"def/*ghi*/jkln abc"def/*ghi*/jkln
abc'def/*ghi*/jkln abc'def/*ghi*/jkln

Note that a C compiler would consider those examples to be erroneous (unterminated string literal, unterminated character literal, multiple characters in a character literal). But many C preprocessors would not, and your program must not.

Your program must detect an unterminated comment. If your program detects end-of-file before a comment is terminated, it must write the message "Error: line X: unterminated comment" to the standard error stream. "X" must be the number of the line on which the unterminated comment begins. Examples:

Standard Input Stream Standard Output Stream Standard Error Stream
abc/*defnghin abcsnn Error:slines1:sunterminatedscommentn
abcdefnghi/*n abcdefnghisn Error:slines2:sunterminatedscommentn
abc/*def/ghinjkln abcsnn Error:slines1:sunterminatedscommentn
abc/*def*ghinjkln abcsnn Error:slines1:sunterminatedscommentn
abc/*defnghi*n abcsnn Error:slines1:sunterminatedscommentn
abc/*defnghi/n abcsnn Error:slines1:sunterminatedscommentn

Your program (more precisely, its main function) must return EXIT_FAILURE if it is unsuccessful, that is, if it detects an unterminated comment and so is unable to remove comments properly. Otherwise it must return EXIT_SUCCESS or, equivalently, 0.

Your program must work for standard input lines of any length.

Your program may assume that the backslash-newline character sequence does not occur in the standard input stream. That is, your program may assume that logical lines are identical to physical lines in the standard input stream. So your de-comment program need not perform the first of the three jobs described above in the "Background" section.


The Procedure

Create your program on the CourseLab cluster using bash, emacs, and gcc217.


Step 1: Create and Populate a Project Directory

The CourseLab /u/cos217/Assignment1 directory contains files that you will find useful: sampledecomment, testdecomment, testdecommentdiff, and many files whose names end with .txt. Subsequent steps describe those files.

Create a project directory (maybe named something like /u/YOURLOGINID/decommentproj). Then copy all files from the /u/cos217/Assignment1 directory to your project directory.


Step 2: Design a DFA

Design a deterministic finite state automaton (DFA, alias FSA) that expresses the required de-commenting logic. The DFA concept is described in lectures, and in the Wikipedia Deterministic finite automaton page.

Design your DFA so it "accepts" a given sequence of characters if the sequence contains no unterminated comments. That is, when given a sequence of characters that does not contain an unterminated comment, your DFA must end in an "accepting" state. Conversely design your DFA so it "rejects" a given sequence of characters if the sequence contains an unterminated comment. That is, when given a sequence of character that contains an unterminated comment, your DFA must end in a "rejecting" state.

Express your DFA using the traditional "labeled ovals and labeled arrows" notation. Let each oval represent a state. Give each state a descriptive name, and indicate whether it is an "accepting" state or a "rejecting" state. Let each arrow represent a transition from one state to another. Label each arrow with the single character, or class of characters, that causes the transition to occur. We encourage (but do not require) you also to label each arrow with action(s) that must occur (for example, "print the character") when the corresponding transition occurs.

Express as much of the de-commenting logic as you can within your DFA. The more de-commenting logic you express in your DFA, the better your grade on the DFA will be.

To properly report unterminated comments, your program must contain logic to keep track of the current line number of the standard input stream. You need not show that logic in your DFA.

Convert your "labeled ovals and labeled arrows" DFA to a textual representation, placing the result in your project directory in a file named dfa. The document TextualDFAs contains examples. Make sure you indicate explicitly which state is the DFA's start state, and whether each state is an accepting state or a rejecting state.

The name of the file must be dfa, not dfa.txt, not DFA, not Dfa, etc.

Step 3: Create Source Code

Use emacs to create source code in your project directory in a file named decomment.c. The decomment.c program must implement your DFA.

If your DFA exits while in an "accepting" state, then your program's exit status must be EXIT_SUCCESS or 0. If your DFA exits while in a "rejecting" state, then your program's exit status must be EXIT_FAILURE. In other words, if your program detects no unterminated comments, then its exit status must be EXIT_SUCCESS or 0; if your program detects an unterminated comment, then its exit status must be EXIT_FAILURE.

Your program must not consist of one large main function. Instead your program must consist of multiple small functions, each of which performs a single well-defined task. In this program you must create one function to implement each state of your DFA, as described in lectures.

Generally, a (large) C program must consist of multiple source code files. For this assignment, you need not split your source code into multiple files. Instead, place all source code in a single source code file. Subsequent assignments will ask you to compose programs consisting of multiple source code files.

We suggest that your program use the standard C getchar function to read characters from the standard input stream.


Step 4: Build (Part 1)

Use the gcc217 command to build your program. At this point issue the "shortcut" gcc217 command to preprocess, compile, assemble, and link your program all at once.


Step 5: Execute

Execute your program multiple times on various input files that test all statements in your program.

As noted previously, we have provided files in CourseLab directory /u/cos217/Assignment1 that you will find helpful:

You also must test your program against its own source code using a command sequence such as this:

./sampledecomment < decomment.c > output1 2> errors1
./decomment < decomment.c > output2 2> errors2
diff -y output1 output2
diff -y errors1 errors2
rm output1 errors1 output2 errors2

Repeat Steps 2, 3, 4, and 5 until your program handles all test files and its own source code properly.


Step 6: Build (Part 2)

Use the gcc217 command to build your program "the long way" by issuing distinct gcc217 commands to preprocess, compile, assemble, and link your program. Examine the intermediate results by issuing these commands:

emacs decomment.i
emacs decomment.s
emacs decomment.o
emacs decomment

Step 7: Create a readme File

Use emacs to edit your copy of the given readme file by answering each question that is expressed therein.

One of the sections of your readme file requires you to list the authorized sources of information that you used to complete the assignment. Another section requires you to list the unauthorized sources of information that you used to complete the assignment. Your grader will not grade your submission unless you have completed those sections. To complete the "authorized sources" section of your readme file, copy the list of authorized sources given in the "Policies" web page to that section, and edit it as appropriate.

Descriptions of your code must not be in the readme file. Instead they must be integrated into your code as comments.

Your readme file must be a plain text file. Don't create your readme file using Microsoft Word or any other word processor. The name of the file must be readme, not readme.txt, not README, not Readme, etc.


Step 8: Provide Feedback

Provide the instructors with your feedback on the assignment. To do that, issue this command:

FeedbackCOS217.py 1

and answer the questions that it asks. That command stores its questions and your answers in a file named feedback in your working directory.


Step 9: Submit

Submit your work. Submit your dfa file, your decomment.c file, the files that gcc217 generated from it, your readme file, and your feedback file electronically by issuing these commands on CourseLab:

submit 1 dfa
submit 1 decomment.c
submit 1 decomment.i decomment.s decomment.o decomment
submit 1 readme feedback
As described in the Assignment 0 specification, you can issue submitandbackup X file1 file2 ... commands instead of submit X file1 file2 ... commands.

We can accept your files only if you submit them by executing submit (or submitandbackup) commands on CourseLab. In particular, we cannot accept your files via e-mail. We cannot accept your DFA in any form other than as a file containing plain text.


Program Style

In part, good program style is defined by the rules given in The Practice of Programming (Kernighan and Pike), as summarized by the Rules of Programming Style document. For this assignment we will pay particular attention to rules 1-24.

These more course-specific rules also apply:


Grading

Minimal requirements to receive credit for decomment.c:

To receive credit for the challenge part of the assignment, your program must write proper line numbers within its "unterminated comment" error messages, and must do so without using global variables. The next section of this assignment specification elaborates.

We will grade your work on two kinds of quality:

To encourage good coding practices, we will deduct points if gcc217 generates warning messages.


Avoiding Global Variables

As noted above, when the standard input stream contains an unterminated comment, your decomment program must write an error message. The error message must contain the number of the line at which the unterminated comment begins. The challenge part of the assignment is to implement that logic without using global variables. Here's an elaboration...

Suppose your program's main function calls some state handling function, and that main wishes to pass some values to the state handling function. The main function can do that using ordinary parameters.

But suppose the state handling function wishes to pass some values back to main. The state handling function can use its return value to pass the first value (for example, the next DFA state) back to main. But how can it pass additional values (for example, a line number) back to main?

One approach is to use global variables, where a global variable is one which is defined outside of all functions. The state handling function could assign the additional values (for example, a line number) to global variables; main then could fetch those values by accessing the global variables.

However, as prescribed by Kernighan and Pike style rule 25, and for reasons that we will discuss later in the course, generally you should avoid using global variables. Instead all communication of values into and out of a function should occur via the function's parameters and its return value.

Indeed in your decomment program you must avoid using global variables. You can do that using either of these two approaches:

Some notes:


This assignment was written by Robert M. Dondero, Jr.
with input from COS 217 preceptors and students