The purpose of this assignment is to help you learn or review (1) the fundamentals of the C programming language, (2) the details of the "de-commenting" task of the C preprocessor, and (3) how to use the GNU/UNIX programming tools, especially bash, xemacs, and gcc.
The C preprocessor is an important part of the C programming system. Given a C source code file, the C preprocessor performs three jobs:
The de-comment job is substantial. For example, the C preprocessor must be sensitive to:
Your task is to compose a C program named "decomment" that performs a subset of the de-comment job of the C preprocessor, as defined below.
Your program should be a UNIX "filter." That is, your program should read characters from the standard input stream, and write characters to the standard output stream and possibly to the standard error stream. Specifically, your program should (1) read text, presumably a C program, from the standard input stream, (2) write that same text to the standard output stream with each comment replaced by a space, and (3) write error and warning messages as appropriate to the standard error stream. A typical execution of your program from the shell might look like this:
decomment < somefile.c > somefilewithoutcomments.c 2> errorandwarningmessages
In the following examples a space character is shown as "s" and a newline character as "n".
Your program should replace each comment with a space. Examples:
Standard Input Stream | Standard Output Stream | Standard Error Stream |
abc/*def*/ghin | abcsghin | |
abc/*def*/sghin | abcssghin | |
abcs/*def*/ghin | abcssghin |
Your program should define "comment" as in the C90 standard. In particular, your program should consider text of the form (/* ... */) to be a comment. It should not consider text of the form (// ... ) to be a comment. Example:
Standard Input Stream | Standard Output Stream | Standard Error Stream |
abc//defn | abc//defn |
Your program should allow a comment to span multiple lines. That is, your program should allow a comment to contain newline characters. Your program should add blank lines as necessary to preserve the original line numbering. Examples:
Standard Input Stream | Standard Output Stream | Standard Error Stream |
abc/*defnghi*/jklnmnon | abcsnjklnmnon | |
abc/*defnghinjkl*/mnonpqrn | abcsnnmnonpqrn |
Your program should not recognize nested comments. Example:
Standard Input | Standard Output | Standard Error |
abc/*def/*ghi*/jkl*/mnon | abcsjkl*/mnon |
Your program should handle C string literals. In particular, your program should not consider text of the form (/* ... */) that occurs within a string literal ("...") to be a comment. Examples:
Standard Input Stream | Standard Output Stream | Standard Error Stream |
abc"def/*ghi*/jkl"mnon | abc"def/*ghi*/jkl"mnon | |
abc/*def"ghi"jkl*/mnon | abcsmnon | |
abc/*def"ghijkl*/mnon | abcsmnon |
Similarly, your program should handle C character literals. In particular, your program should not consider text of the form (/* ... */) that occurs within a character literal ('...') to be a comment. Examples:
Standard Input Stream | Standard Output Stream | Standard Error Stream |
abc'def/*ghi*/jkl'mnon | abc'def/*ghi*/jkl'mnon | |
abc/*def'ghi'jkl*/mnon | abcsmnon | |
abc/*def'ghijkl*/mnon | abcsmnon |
Note that the C compiler would consider the first of those examples to be erroneous (multiple characters in a character literal). But many C preprocessors would not, and your program should not.
Your program should handle escaped characters within string literals. That is, when your program reads a backslash (\) while processing a string literal, your program should consider the next character to be an ordinary character that is devoid of any special meaning. In particular, your program should consider text of the form ("...\" ...") to be a valid string literal which happens to contain the double quote character. Examples:
Standard Input | Standard Output | Standard Error |
abc"def\"ghi"jkln | abc"def\"ghi"jkln | |
abc"def\'ghi"jkln | abc"def\'ghi"jkln |
Similarly, your program should handle escaped characters within character literals. That is, when your program reads a backslash (\) while processing a character literal, your program should consider the next character to be an ordinary character that is devoid of any special meaning. In particular, your program should consider text of the form ('...\' ...') to be a valid character literal which happens to contain the quote character. Examples:
Standard Input | Standard Output | Standard Error |
abc'def\'ghi'jkln | abc'def\'ghi'jkln | |
abc'def\"ghi'jkln | abc'def\"ghi'jkln |
Note that the C compiler would consider both of those examples to be erroneous (multiple characters in a character literal). But the C preprocessor would not, and your program should not.
Your program should handle newline characters in C string literals without generating errors or warnings. Examples:
Standard Input Stream | Standard Output Stream | Standard Error Stream |
abc"defnghi"jkln | abc"defnghi"jkln | |
abc"defnghinjkl"mnon | abc"defnghinjkl"mnon |
Note that a C compiler would consider those examples to be erroneous (newline character in a string literal). But many C preprocessors would not, and your program should not.
Similarly, your program should handle newline characters in C character literals without generating errors or warnings. Examples:
Standard Input Stream | Standard Output Stream | Standard Error Stream |
abc'defnghi'jkln | abc'defnghi'jkln | |
abc'defnghinjkl'mnon | abc'defnghinjkl'mnon |
Note that a C compiler would consider those examples to be erroneous (multiple characters in a character literal, newline character in a character literal). But many C preprocessors would not, and your program should not.
Your program should handle unterminated string and character literals without generating errors or warnings. Examples:
Standard Input Stream | Standard Output Stream | Standard Error Stream |
abc"def/*ghi*/jkln | abc"def/*ghi*/jkln | |
abc'def/*ghi*/jkln | abc'def/*ghi*/jkln |
Note that a C compiler would consider those examples to be erroneous (unterminated string literal, unterminated character literal, multiple characters in a character literal). But many C preprocessors would not, and your program should not.
Your program should detect an unterminated comment. If your program detects end-of-file before a comment is terminated, it should write the message "Error: line X: unterminated comment" to the standard error stream. "X" should be the number of the line on which the unterminated comment begins. Examples:
Standard Input Stream | Standard Output Stream | Standard Error Stream |
abc/*defnghin | abcsnn | Error:slines1:sunterminatedscommentn |
abcdefnghi/*n | abcdefnghisn | Error:slines2:sunterminatedscommentn |
abc/*def/ghinjkln | abcsnn | Error:slines1:sunterminatedscommentn |
abc/*def*ghinjkln | abcsnn | Error:slines1:sunterminatedscommentn |
abc/*defnghi*n | abcsnn | Error:slines1:sunterminatedscommentn |
abc/*defnghi/n | abcsnn | Error:slines1:sunterminatedscommentn |
Your program should work for standard input lines of any length.
You may assume that the final line of the standard input stream ends with the newline character, as files created with xemacs typically do.
Your program may assume that the backslash-newline character sequence does not occur in the standard input stream. That is, your program may assume that logical lines are identical to physical lines in the standard input stream.
Your program (more precisely, its main() function) should return EXIT_FAILURE if it was unsuccessful, that is, if it detects an unterminated comment and so was unable to remove comments properly. Otherwise it should return 0.
Design your program as a deterministic finite state automaton (DFA, alias FSA). The DFA concept is described in lectures, and in Section 7.3 of the book Introduction to CS (Sedgewick and Wayne). That book section is available through the web at http://www.cs.princeton.edu/introcs/73fsa/.
Generally, a (large) C program should consist of of multiple source code files. For this assignment, you need not split your source code into multiple files. Instead you may place all source code in a single source code file. Subsequent assignments will ask you to write programs consisting of multiple source code files.
We suggest that your program use the standard C getchar() function to read characters from the standard input stream.
You should create your program on hats using bash, xemacs and gcc.
Express your DFA using the traditional "ovals and labeled arrows" notation. More precisely, use the same notation as is used in the examples from Section 7.3 of the Sedgewick and Wayne book. Capture as much of the program's logic as you can within your DFA. The more logic you express in your DFA, the better your grade on the DFA will be.
Use xemacs to create source code in a file named decomment.c that implements your DFA.
Use the gcc command with the -Wall, -ansi, and -pedantic options to preprocess, compile, assemble, and link your program. Perform each step individually, and examine the intermediate results to the extent possible.
Execute your program multiple times on various input files that test all logical paths through your code.
We have provided several files in hats directory /u/cos217/Assignment1:
(1) sampledecomment is an executable version of a correct assignment solution. Your program should write exactly (character for character) the same data to the standard output stream and the standard error stream as sampledecomment does. You should test your program using commands similar to these:
sampledecomment < somefile.c > output1 2> errors1 decomment < somefile.c > output2 2> errors2 diff -c output1 output2 diff -c errors1 errors1 rm output1 errors1 output2 errors2The UNIX diff command finds differences between two given files. The executions of the diff command shown above should produce no output. If the command "diff -c output1 output2" produces output, then sampledecomment and your program have written different characters to the standard output stream. Similarly, if the command "diff -c errors1 errors2" produces output, then sampledecomment and your program have written different characters to the standard error stream.
(2) Several .txt files (that is, files whose names end with ".txt") can serve as input files to your program.
(3) testdecomment and testdecommentdiff are bash scripts that automate the testing process. Comments at the beginning of those files describe how to use them. After copying the scripts to your project directory, you may need to execute the commands "chmod 700 testdecomment" and "chmod 700 testdecommentdiff" to give them "executable" permissions.
Copy those files to your project directory, and use them to help you test your decomment program.
You also might test your decomment program against its own source code using a command sequence such as this:
sampledecomment < decomment.c > output1 2> errors1 decomment < decomment.c > output2 2> errors2 diff -c output1 output2 diff -c errors1 errors2 rm output1 errors1 output2 errors2
or this:
decomment < decomment.c > decomment2.c gcc -E decomment2.c > one.i rm decomment2.c cp decomment.c decomment2.c gcc -E decomment2.c > two.i rm decomment2.c diff -c one.i two.i rm one.i two.i
Use xemacs to create a text file named "readme" (not "readme.txt", or "README", or "Readme", etc.) that contains:
Descriptions of your code should not be in the readme file. Instead they should be integrated into your code as comments.
Your readme file should be a plain text file. Don't create your readme file using Microsoft Word or any other word processor.
Submit your work. Submit your decomment.c file, the files that gcc generated from it, and your readme file electronically by issuing this command on hats:
submit 1 decomment.c decomment.i decomment.s decomment.o decomment readme
Also submit your DFA. You can do that using either of these two options:
If you use option 2, then name the text file "dfa" (not "dfa.txt", "DFA", etc.) and submit it by issuing this command on hats:
submit 1 dfa
We will grade your work on two kinds of quality: quality from the user's point of view, and quality from the programmer's point of view. To encourage good coding practices, we will build using "gcc -Wall -ansi -pedantic" and take off points based on warning messages.
From the user's point of view, a program has quality if it behaves as it should. The correct behavior of the decomment program is defined by the previous sections of this assignment specification, and by the behavior of the given sampledecomment program.
From the programmer's point of view, a program has quality if it is well styled and thereby easy to maintain. In part, style is defined by the rules given in The Practice of Programming (Kernighan and Pike), as summarized by the Rules of Programming Style document. For this assignment we will pay particular attention to rules 1-24. These additional rules apply:
As prescribed by Kernighan and Pike style rule 25, generally you should avoid using global variables. Instead all communication of data into and out of a function should occur via the function's parameters and its return value. You should use ordinary "call-by-value" parameters to communicate data from a calling function to your function. You should use your function's return value to communicate data from your function back to its calling function. You should use "call-by-reference" parameters to communicate additional data from your function back to its calling function, or as bi-directional channels of communication.
However, call-by-reference involves using pointer variables, which we have not discussed yet. So for this assignment you may use global variables instead of call-by-reference parameters. (But we encourage you to use call-by-reference parameters.)
In short, you should use ordinary call-by-value function parameters and function return values in your program as appropriate. But you need not use call-by-reference parameters; instead you may use global variables. In subsequent assignments you should use global variables only when there is a good reason to do so.