Princeton University
COS 217: Introduction to Programming Systems

Assignment 1: A "De-Comment" Program

Purpose

The purpose of this assignment is to help you learn or review (1) the fundamentals of the C programming language, (2) the details of the "de-commenting" task of the C preprocessor, and (3) how to use the GNU/UNIX programming tools, especially bash, xemacs, and gcc.

Background

The C preprocessor is an important part of the C programming system. Given a C source code file, the C preprocessor performs three jobs:

Merge "physical" lines of source code into "logical" lines. That is, when the preprocessor detects a backslash character immediately followed by a newline character, it discards both of those characters.
Remove comments from ("de-comment") the source code.
Handle preprocessor directives (#define, #include, etc.) that reside in the source code.

The "de-comment" job is substantial. For example, note that the C preprocessor must be sensitive to:

The fact that a comment is a token delimiter. After removing a comment, the C preprocessor must make sure that a whitespace character is in its place.
Line numbers. After removing a comment, the C preprocessor sometimes must insert blank lines in its place to preserve the original line numbering.
String and character literal boundaries. The preprocessor must not consider the character sequence (/*...*/) to be a comment if it occurs inside a string literal ("...") or character literal ('...').

Your Task

Your task is to compose a C program named "decomment" that performs a subset of the de-comment job of the C preprocessor, as defined below.

Your program should be structured as a UNIX "filter." That is, your program should read characters from standard input, and write characters to standard output and possibly to standard error. Specifically, your program should (1) read text, presumably a C program, from standard input, (2) write that same text to standard output with each comment replaced by a space, and (3) write error and warning messages as appropriate to standard error. A typical command-line execution of your program might look like this:

decomment < somefile.c > somefilewithoutcomments.c 2> errorandwarningmessages

Functionality

In the following examples a space is shown as "_s" and a newline character as "_n". Your program should:

Replace each completed comment with a single space. Examples:

Standard Input Standard Output Standard Error

abc/*def*/ghi_n abc_sghi_n

abc/*def*/_sghi_n abc_ssghi_n

abc_s/*def*/ghi_n abc_ssghi_n

Define "comment" as in the C89 standard. In particular, your program should consider text of the form (/* ... */) to be a comment. It should not consider text of the form (// ... ) to be a comment. Example:

Standard Input Standard Output Standard Error

abc//def_n abc//def_n

Allow a comment to span multiple lines. That is, your program should allow a comment to contain newline characters. Your program should add blank lines as necessary to preserve the original line numbering. Examples:

Standard Input Standard Output Standard Error
abc/*def_n
ghi*/jkl_n
mno_n

abc_n
_sjkl_n
mno_n
 
abc/*def_n
ghijkl_n
mno*/pqr_n
stu_n abc_n
_n
_spqr_n
stu_n

Not recognize nested comments. Example:

Standard Input Standard Output Standard Error

abc/*def/*ghi*/jkl*/mno_n abc_sjkl*/mno_n

Handle C string literals. In particular, your program should not consider text of the form (/* ... */) that occurs within a string literal ("...") to be a comment. Examples:

Standard Input Standard Output Standard Error

abc"def/*ghi*/jkl"mno_n abc"def/*ghi*/jkl"mno_n

abc/*def"ghi"jkl*/mno_n abc_smno_n

abc/*def"ghijkl*/mno_n abc_smno_n

abc"def'ghi'jkl"mno_n abc"def'ghi'jkl"mno_n

Handle C character literals. In particular, your program should not consider text of the form (/* ... */) that occurs within a character literal ('...') to be a comment. Examples:

Standard Input Standard Output Standard Error

abc'def/*ghi*/jkl'mno_n abc'def/*ghi*/jkl'mno_n

abc/*def'ghi'jkl*/mno_n abc_smno_n

abc/*def'ghijkl*/mno_n abc_smno_n

abc'def"ghi"jkl'mno_n abc'def"ghi"jkl'mno_n

Note that the C compiler would consider some of those examples to be erroneous. But the C preprocessor would not, and your program should not.

Handle escaped characters within string literals. That is, when your program reads a backslash (\) while processing a string literal, your program should consider the next character to be an ordinary character that is devoid of any special meaning. In particular, your program should consider text of the form ("...\" ...") to be a valid string literal which happens to contain the double quote character. Examples:

Standard Input Standard Output Standard Error

abc"def\"ghi"jkl_n abc"def\"ghi"jkl_n

abc"def\'ghi"jkl_n abc"def\'ghi"jkl_n

Handle escaped characters within character literals. That is, when your program reads a backslash (\) while processing a character literal, your program should consider the next character to be an ordinary character that is devoid of any special meaning. In particular, your program should consider text of the form ('...\' ...') to be a valid character literal which happens to contain the quote character. Examples:

Standard Input Standard Output Standard Error

abc'def\'ghi'jkl_n abc'def\'ghi'jkl_n

abc'def\"ghi'jkl_n abc'def\"ghi'jkl_n

Note that the C compiler would consider both of those examples to be erroneous. But the C preprocessor would not, and your program should not.

Allow multi-line C string literals, but generate warning messages when they occur. Specifically, your program should write the message "Warning: line X: newline in string literal" when a newline character occurs within a string literal. "X" should be the number of the line which contains the offending newline character. Examples:

Standard Input Standard Output Standard Error

abc"def_n
ghi"jkl_n abc"def_n
ghi"jkl_n Warning:_sline_s1:_snewline_sin_sstring_sliteral_n

abc"def_n
ghi_n
jkl"mno_n abc"def_n
ghi_n
jkl"mno_n Warning:_sline_s1:_snewline_sin_sstring_sliteral_n
Warning:_sline_s2:_snewline_sin_sstring_sliteral_n

Allow multi-line C character literals, but generate warning messages when they occur. Specifically, your program should write the message "Warning: line X: newline in character literal" when a newline character occurs within a character literal. "X" should be the number of the line which contains the offending newline character. Examples:

Standard Input Standard Output Standard Error

abc'def_n
ghi'jkl_n abc'def_n
ghi'jkl_n Warning:_sline_s1:_snewline_sin_scharacter_sliteral_n

abc'def_n
ghi_n
jkl'mno_n abc'def_n
ghi_n
jkl'mno_n Warning:_sline_s1:_snewline_sin_scharacter_sliteral_n
Warning:_sline_s2:_snewline_sin_scharacter_sliteral_n

Note that the C compiler would consider both of those examples to be erroneous. But the C preprocessor would not, and your program should not.

Detect an unterminated string literal. If your program detects end-of-file before a string literal is terminated, it should write the message "Error: line X: unterminated string literal" to standard error. "X" should be the number of the line on which the unterminated string literal begins. Examples:

Standard Input Standard Output Standard Error

abc"def_n
ghi_n
jkl_n abc"def_n
ghi_n
jkl_n Warning:_sline_s1:_snewline_sin_sstring_sliteral_n
Warning:_sline_s2:_snewline_sin_sstring_sliteral_n
Warning:_sline_s3:_snewline_sin_sstring_sliteral_n
Error:_sline_s1:_sunterminated_sstring_sliteral_n

Detect an unterminated character literal. If your program detects end-of-file before a character literal is terminated, it should write the message "Error: line X: unterminated character literal" to standard error. "X" should be the number of the line on which the unterminated character literal begins. Examples:

Standard Input Standard Output Standard Error

abc'def_n
ghi_n
jkl_n abc'def_n
ghi_n
jkl_n Warning:_sline_s1:_snewline_sin_scharacter_sliteral_n
Warning:_sline_s2:_snewline_sin_scharacter_sliteral_n
Warning:_sline_s3:_snewline_sin_scharacter_sliteral_n
Error:_sline_s1:_sunterminated_scharacter_sliteral_n

Note that the C compiler would consider that example to be erroneous. But the C preprocessor would not, and your program should not.

Detect an unterminated comment. If your program detects end-of-file before a comment is terminated, it should write the message "Error: line X: unterminated comment" to standard error. "X" should be the number of the line on which the unterminated comment begins.

Standard Input Standard Output Standard Error

abc/*def_n
ghi_n abc_n
_n Error:_sline_s1:_sunterminated_scomment_n

abcdef_n
ghi/*_n abcdef_n
ghi_n Error:_sline_s2:_sunterminated_scomment_n

abc/*def/ghi_n
jkl_n abc_n
_n Error:_sline_s1:_sunterminated_scomment_n

abc/*def*ghi_n
jkl_n abc_n
_n Error:_sline_s1:_sunterminated_scomment_n

abc/*def_n
ghi*_n abc_n
_n Error:_sline_s1:_sunterminated_scomment_n

abc/*def_n
ghi/_n abc_n
_n Error:_sline_s1:_sunterminated_scomment_n

Your program should work for standard input lines of any length.

Your program may assume that the backslash-newline character sequence does not occur in standard input. That is, your program may assume that "logical" lines are identical to "physical" lines in standard input.

Design

We strongly suggest that you design your program as a deterministic finite state automaton (FSA), as described in lectures.

Your program should not consist of one large main function. Instead your program should consist of multiple small functions, each of which performs a single well-defined task. For example, you might create one function to implement each state of your FSA.

Generally, all communication of data into and out of a function should occur via the function's parameters and its return value, and not via global variables. You should use ordinary "call-by-value" parameters to communicate data from a calling function to your function. You should use your function's return value to communicate data from your function back to its calling function. You should use "call-by-reference" parameters to communicate additional data from your function back to its calling function, or as bi-directional channels of communication. However, call-by-reference involves using pointer variables, which we have not discussed yet. So for this assignment you may use global variables instead of call-by-reference parameters. (But we encourage you to use call-by-reference parameters.)

In short, you should use ordinary call-by-value function parameters and function return values in your program as appropriate. But you need not use call-by-reference parameters; instead you may use global variables. In subsequent assignments you should use global variables sparingly, and only when there is no reasonable alternative.

Generally, a (large) C program should consist of of multiple source code files. For this assignment, you need not split your source code into multiple files. Instead you may place all source code in a single source code file. Subsequent assignments will ask you to write programs consisting of multiple source code files.

Please limit line lengths in your source code to 78 characters. Doing so allows us to print your work in two columns, thus saving paper.

We suggest that your program read characters from standard input using the standard C getchar() function.

Logistics

You should create your program on hats using bash, xemacs and gcc.

Step 1: Create Source Code

Use xemacs to create source code in a file named decomment.c.

Step 2: Preprocess, Compile, Assemble, and Link

Use the gcc command with the -Wall, -ansi, and -pedantic options to preprocess, compile, assemble, and link your program.

Step 3: Execute

Execute your program multiple times on various input files that test all logical paths through your code.

We have provided several files in hats directory /u/cos217/Assignment1. You should copy those files to your project directory, and use them to help you test your decomment program.

sampledecomment is an executable version of a correct assignment solution. Your program should write exactly (character for character) the same data to standard output and standard error as does sampledecomment. You should test your program using commands similar to these:

sampledecomment < somefile.c > output1 2> errors1
decomment < somefile.c > output2 2> errors2
diff output1 output2
diff errors1 errors1
rm output1 errors1 output2 errors2
The UNIX diff command finds differences between two given files. The executions of the diff command shown above should produce no output. If the command "diff output1 output2" produces output, then sampledecomment and your program have written different characters to standard output. Similarly, if the command "diff errors1 errors2" produces output, then sampledecomment and your program have written different characters to standard error.

Several .txt files (that is, files whose names end with ".txt") can serve as input files to your program.
grade1 and grade1diff are bash scripts that automate the testing process. Comments at the beginning of those files describe how to use them. After copying the scripts to your project directory, you may need to execute the commands "chmod 700 grade1" and "chmod 700 grade1diff" to make them executable.

Step 4: Create a readme File

Use xemacs to create a text file named "readme" that contains:

Your name and the assignment number.
A description of whatever help (if any) you received from others while doing the assignment, and the names of any individuals with whom you collaborated, as prescribed by the course Policies web page.
(Optionally) An indication of how much time you spent doing the assignment.
(Optionally) Your assessment of the assignment: Did it help you to learn? What did it help you to learn? Do you have any suggestions for improvement? Etc.
(Optionally) Any information that will help us to grade your work in the most favorable light. In particular you should describe all known bugs.

Descriptions of your code should not be in the readme file. Instead they should be integrated into your code as comments.

Step 5: Submit

Submit your work electronically on hats via the command:

/u/cos217/bin/i686/submit 1 decomment.c readme

If the directory /u/cos217/bin/i686 is in your PATH environment variable, then you can abbreviate that command as:

submit 1 decomment.c readme

If you are using the bash shell and have copied files .bashrc and .bash_profile from the /u/cos217 directory to your HOME directory, then directory /u/cos217/bin/i686 indeed is in your PATH environment variable. You can examine your PATH environment variable by executing the command "printenv PATH".

Grading

We will grade your work on functionality and design. We will consider understandability to be an important aspect of good design. See the next section for guidelines concerning program understandability. To encourage good coding practices, we will build using "gcc -Wall -ansi -pedantic" and take off points based on warning messages.

Program Understandability

An understandable program:

Uses a consistent and appropriate indentation scheme. All statements that are nested within a compound, if, switch, while, for, or do...while statement should be indented. Please use spaces instead of tabs to indent. Please use at least a 3-space indentation scheme. Note that the xemacs editor can apply a consistent indentation scheme to your program automatically.
Contains descriptive identifiers. The names of variables, constants, structures, types, and functions should indicate their purpose. Remember that C can handle identifiers of any length, and the first 31 characters are significant. We encourage you to prefix each variable name with characters that indicate its type. For example, the prefix "c" might indicate that the variable is of type char, "i" might indicate int, "pc" might mean pointer to char, "ui" might mean unsigned int, etc.
Contains carefully worded comments. Each source code file should begin with a comment that includes your name, the number of the assignment, and the name of the file. Each function -- especially the main function -- should begin with a comment that describes what the computer does when it executes that function. It should do so by explicitly referring to the function's parameters and return value. The comment also should state what, if anything, the computer reads from standard input or any other stream, and what, if anything, the computer writes to standard output, standard error, or any other stream while executing the function. Finally, the function's comment should state which global variables the computer uses and affects when executing the function.

Standard Input	Standard Output	Standard Error
abc/def_n ghi/jkl_n mno_n	abc_n _sjkl_n mno_n
abc/def_n ghijkl_n mno/pqr_n stu_n	abc_n _n _spqr_n stu_n

Standard Input	Standard Output	Standard Error
abc"def_n ghi"jkl_n	abc"def_n ghi"jkl_n	Warning:_sline_s1:_snewline_sin_sstring_sliteral_n
abc"def_n ghi_n jkl"mno_n	abc"def_n ghi_n jkl"mno_n	Warning:_sline_s1:_snewline_sin_sstring_sliteral_n Warning:_sline_s2:_snewline_sin_sstring_sliteral_n

Standard Input	Standard Output	Standard Error
abc'def_n ghi'jkl_n	abc'def_n ghi'jkl_n	Warning:_sline_s1:_snewline_sin_scharacter_sliteral_n
abc'def_n ghi_n jkl'mno_n	abc'def_n ghi_n jkl'mno_n	Warning:_sline_s1:_snewline_sin_scharacter_sliteral_n Warning:_sline_s2:_snewline_sin_scharacter_sliteral_n

Standard Input	Standard Output	Standard Error
abc/*def_n ghi_n	abc_n _n	Error:_sline_s1:_sunterminated_scomment_n
abcdef_n ghi/*_n	abcdef_n ghi_n	Error:_sline_s2:_sunterminated_scomment_n
abc/*def/ghi_n jkl_n	abc_n _n	Error:_sline_s1:_sunterminated_scomment_n
abc/defghi_n jkl_n	abc_n _n	Error:_sline_s1:_sunterminated_scomment_n
abc/def_n ghi_n	abc_n _n	Error:_sline_s1:_sunterminated_scomment_n
abc/*def_n ghi/_n	abc_n _n	Error:_sline_s1:_sunterminated_scomment_n

Princeton University COS 217: Introduction to Programming Systems